canonical / kfp-operators

Kubeflow Pipelines Operators
Apache License 2.0
2 stars 11 forks source link

KFP Recurrings Runs not working on kubeflow 1.8 #517

Open alelucrod opened 5 months ago

alelucrod commented 5 months ago

Bug Description

I am experiencing an issue after a fresh installation of Kubeflow 1.8/stable (following the official guide). I can launch manual runs, and they execute successfully. However, recurring runs, whether they are Periodic or Cron, do not launch.

In contrast, if I install Kubeflow 1.7 with MicroK8s 1.24, recurring runs do work. Is anyone else experiencing the same issue?"

To Reproduce

Fresh install (https://charmed-kubeflow.io/docs/get-started-with-charmed-kubeflow)

Environment

Ubuntu 23.10 Microk8s 1.26 Charmed-kubeflow 1.8/stable

Relevant Log Output

All units working as expected

Additional Context

No response

syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5295.

This message was autogenerated

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5884.

This message was autogenerated

mvlassis commented 2 weeks ago

After deploying Kubeflow 1.8/stable with Microk8s 1.26, I can confirm that periodic recurring runs will not be scheduled. After investigation, we found out that kfp-schedwf is reporting the following error:

2024-06-20T10:52:46.510Z [controller] W0620 10:52:46.504648      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:46.510Z [controller] E0620 10:52:46.504675      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:47.706Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.445816616s 200
2024-06-20T10:52:47.732Z [pebble] GET /v1/services 84.622µs 200
2024-06-20T10:52:47.854Z [controller] W0620 10:52:47.854466      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:47.854Z [controller] E0620 10:52:47.854489      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:49.478Z [controller] W0620 10:52:49.478066      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:49.478Z [controller] E0620 10:52:49.478092      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:50.570Z [pebble] GET /v1/services 57.824µs 200
2024-06-20T10:52:53.600Z [controller] W0620 10:52:53.600150      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:53.600Z [controller] E0620 10:52:53.600192      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:55.153Z [pebble] GET /v1/plan?format=yaml 121.998µs 200
2024-06-20T10:52:55.175Z [pebble] GET /v1/services 55.578µs 200
2024-06-20T10:52:57.715Z [pebble] GET /v1/services 49.722µs 200
2024-06-20T10:53:01.812Z [controller] W0620 10:53:01.812605      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:53:01.812Z [controller] E0620 10:53:01.812726      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:53:03.067Z [pebble] GET /v1/plan?format=yaml 103.287µs 200
2024-06-20T10:53:03.074Z [pebble] GET /v1/services 37.85µs 200
2024-06-20T10:53:05.112Z [pebble] GET /v1/services 40.338µs 200
2024-06-20T10:53:09.950Z [pebble] GET /v1/plan?format=yaml 144.632µs 200
2024-06-20T10:53:09.974Z [pebble] GET /v1/services 51.045µs 200
2024-06-20T10:53:12.323Z [pebble] GET /v1/services 44.12µs 200
2024-06-20T10:53:23.002Z [controller] W0620 10:53:23.000721      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:53:23.002Z [controller] E0620 10:53:23.000751      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:54:09.293Z [controller] W0620 10:54:09.293438      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:54:09.293Z [controller] E0620 10:54:09.293659      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:55:02.657Z [controller] W0620 10:55:02.656928      14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:55:02.657Z [controller] E0620 10:55:02.656960      14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
mvlassis commented 2 weeks ago

After removing and redeploying the kfp-schedwf charm, the scheduled runs still do not work, but the error disappears:

2024-06-21T11:08:50.146Z [pebble] HTTP API server listening on ":38813".
2024-06-21T11:08:50.146Z [pebble] Started daemon.
2024-06-21T11:09:12.032Z [pebble] GET /v1/plan?format=yaml 2.278608ms 200
2024-06-21T11:09:12.033Z [pebble] POST /v1/layers 185.549µs 200
2024-06-21T11:09:12.042Z [pebble] POST /v1/services 8.225642ms 202
2024-06-21T11:09:12.054Z [pebble] Service "controller" starting: /bin/controller --logtostderr=true --namespace={self.namespace}
2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Location: UTC"
2024-06-21T11:09:12.105Z [controller] W0621 11:09:12.105200      15 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Creating event broadcaster"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Setting up event handlers"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting ScheduledWorkflow controller"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Waiting for informer caches to sync"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting workers"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Started workers"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Wait for shut down"
2024-06-21T11:09:13.067Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.024139014s 200
2024-06-21T11:09:13.081Z [pebble] GET /v1/services 86.216µs 200
2024-06-21T11:09:16.649Z [pebble] GET /v1/services 32.189µs 200
2024-06-21T11:09:20.912Z [pebble] GET /v1/plan?format=yaml 98.649µs 200
2024-06-21T11:09:20.923Z [pebble] GET /v1/services 38.434µs 200
2024-06-21T11:09:24.588Z [pebble] GET /v1/services 39.909µs 200
2024-06-21T11:09:28.905Z [pebble] GET /v1/plan?format=yaml 120.975µs 200
2024-06-21T11:09:28.915Z [pebble] GET /v1/services 34.192µs 200
2024-06-21T11:09:32.505Z [pebble] GET /v1/services 52.596µs 200
2024-06-21T11:09:36.694Z [pebble] GET /v1/plan?format=yaml 115.939µs 200
2024-06-21T11:09:36.711Z [pebble] GET /v1/services 61.221µs 200
2024-06-21T11:09:40.229Z [pebble] GET /v1/services 36.619µs 200

I also checked that the charm is trusted, and that the crds and the RBAC manifests match those on the upstream.

After testing an upstream 1.8 Kubeflow deployment, I can confirm that scheduled runs do work there.

mvlassis commented 2 weeks ago

These are the logs from kfp-api, there seems to be an error with an empty namespace:

2024-06-21T12:26:23.424Z [apiserver] I0621 12:26:23.423747      74 error.go:278] Invalid input error: a recurring run cannot have an empty namespace in multi-user mode
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.NewInvalidInputError
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:185
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).canAccessJob
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/server/job_server.go:421
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).listJobs
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/server/job_server.go:167
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).ListRecurringRuns
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/server/job_server.go:345
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler.func1
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1698
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/interceptor.go:30
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver]    /snap/go/10584/src/runtime/asm_amd64.s:1650
2024-06-21T12:26:23.424Z [apiserver] Failed to list recurring runs due to authorization error. Check if you have permission to access namespace 
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrapf
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:266
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.Wrapf
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:337
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).listJobs
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/server/job_server.go:169
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).ListRecurringRuns
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/server/job_server.go:345
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler.func1
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1698
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/interceptor.go:30
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver]    /snap/go/10584/src/runtime/asm_amd64.s:1650
2024-06-21T12:26:23.424Z [apiserver] Failed to list jobs
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrap
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:271
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.Wrap
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:350
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).ListRecurringRuns
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/server/job_server.go:347
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler.func1
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1698
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/interceptor.go:30
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver]    /snap/go/10584/src/runtime/asm_amd64.s:1650
2024-06-21T12:26:23.424Z [apiserver] /kubeflow.pipelines.backend.api.v2beta1.RecurringRunService/ListRecurringRuns call failed
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrapf
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:266
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.Wrapf
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/common/util/error.go:337
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/src/apiserver/interceptor.go:32
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver]    /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver]    /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver]    /snap/go/10584/src/runtime/asm_amd64.s:1650
sombrafam commented 1 week ago

After removing and redeploying the kfp-schedwf charm, the scheduled runs still do not work, but the error disappears:

2024-06-21T11:08:50.146Z [pebble] HTTP API server listening on ":38813".
2024-06-21T11:08:50.146Z [pebble] Started daemon.
2024-06-21T11:09:12.032Z [pebble] GET /v1/plan?format=yaml 2.278608ms 200
2024-06-21T11:09:12.033Z [pebble] POST /v1/layers 185.549µs 200
2024-06-21T11:09:12.042Z [pebble] POST /v1/services 8.225642ms 202
2024-06-21T11:09:12.054Z [pebble] Service "controller" starting: /bin/controller --logtostderr=true --namespace={self.namespace}
2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Location: UTC"
2024-06-21T11:09:12.105Z [controller] W0621 11:09:12.105200      15 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Creating event broadcaster"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Setting up event handlers"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting ScheduledWorkflow controller"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Waiting for informer caches to sync"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting workers"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Started workers"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Wait for shut down"
2024-06-21T11:09:13.067Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.024139014s 200
2024-06-21T11:09:13.081Z [pebble] GET /v1/services 86.216µs 200
2024-06-21T11:09:16.649Z [pebble] GET /v1/services 32.189µs 200
2024-06-21T11:09:20.912Z [pebble] GET /v1/plan?format=yaml 98.649µs 200
2024-06-21T11:09:20.923Z [pebble] GET /v1/services 38.434µs 200
2024-06-21T11:09:24.588Z [pebble] GET /v1/services 39.909µs 200
2024-06-21T11:09:28.905Z [pebble] GET /v1/plan?format=yaml 120.975µs 200
2024-06-21T11:09:28.915Z [pebble] GET /v1/services 34.192µs 200
2024-06-21T11:09:32.505Z [pebble] GET /v1/services 52.596µs 200
2024-06-21T11:09:36.694Z [pebble] GET /v1/plan?format=yaml 115.939µs 200
2024-06-21T11:09:36.711Z [pebble] GET /v1/services 61.221µs 200
2024-06-21T11:09:40.229Z [pebble] GET /v1/services 36.619µs 200

I also checked that the charm is trusted, and that the crds and the RBAC manifests match those on the upstream.

After testing an upstream 1.8 Kubeflow deployment, I can confirm that scheduled runs do work there.

Can you share the steps that you use it to deploy the upstream deployment?

mvlassis commented 1 week ago
sombrafam commented 1 week ago

Hi @mvlassis, can you share more details about that procedure? Where can I get those manifests so I can try?

mvlassis commented 1 week ago

@sombrafam The procedure is as follows:

The recurring runs should then work

eleblebici commented 1 week ago

Hi @sombrafam @mvlassis , I also tried the workaround mentioned in this issue #352

After redeploying the kfp-schedwf with juju deploy kfp-schedwf --channel 1.7/stable --resource oci-image=gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha. it first removed the current scheduledworkflow. I created a new recurring run and a new scheduledworkflow is also created. With the older one the runs were not triggered, but after I redeploy it, it triggered the runs though they are not running and giving an error (resource failed to execute).

eleblebici commented 1 week ago

@mvlassis can you please share the scheduledworkflow.yaml file you used?

eleblebici commented 1 week ago

@mvlassis can you please share the scheduledworkflow.yaml file you used?

Ok, I created it and after deploying the below the recurring runs started to work

---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: ml-pipeline-scheduledworkflow
  namespace: kubeflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app: ml-pipeline-scheduledworkflow-role
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: ml-pipeline-scheduledworkflow-role
  namespace: kubeflow
rules:
- apiGroups:
  - argoproj.io
  resources:
  - workflows
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - kubeflow.org
  resources:
  - scheduledworkflows
  - scheduledworkflows/finalizers
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
---
aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      duck.knative.dev/addressable: "true"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: knative-eventing
    app.kubernetes.io/name: knative-eventing
    app.kubernetes.io/version: 1.10.1
    kustomize.component: knative
  name: addressable-resolver
rules: []
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: poddefaults
    app.kubernetes.io/component: poddefaults
    app.kubernetes.io/name: poddefaults
    kustomize.component: poddefaults
  name: admission-webhook-cluster-role
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - poddefaults
  verbs:
  - get
  - watch
  - list
  - update
  - create
  - patch
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: ml-pipeline-scheduledworkflow-role
rules:
- apiGroups:
  - argoproj.io
  resources:
  - workflows
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - kubeflow.org
  resources:
  - scheduledworkflows
  - scheduledworkflows/finalizers
  verbs:
  - create
  - get
  - list
  - watch
  - update
  - patch
  - delete
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: ml-pipeline-scheduledworkflow-binding
  namespace: kubeflow
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: ml-pipeline-scheduledworkflow-role
subjects:
- kind: ServiceAccount
  name: ml-pipeline-scheduledworkflow
  namespace: kubeflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: ml-pipeline-scheduledworkflow-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: ml-pipeline-scheduledworkflow-role
subjects:
- kind: ServiceAccount
  name: ml-pipeline-scheduledworkflow
  namespace: kubeflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ml-pipeline-scheduledworkflow
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: ml-pipeline-scheduledworkflow
  namespace: kubeflow
spec:
  selector:
    matchLabels:
      app: ml-pipeline-scheduledworkflow
      app.kubernetes.io/component: ml-pipeline
      app.kubernetes.io/name: kubeflow-pipelines
      application-crd-id: kubeflow-pipelines
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
      labels:
        app: ml-pipeline-scheduledworkflow
        app.kubernetes.io/component: ml-pipeline
        app.kubernetes.io/name: kubeflow-pipelines
        application-crd-id: kubeflow-pipelines
    spec:
      containers:
      - env:
        - name: NAMESPACE
          value: ""
        - name: CRON_SCHEDULE_TIMEZONE
          valueFrom:
            configMapKeyRef:
              key: cronScheduleTimezone
              name: pipeline-install-config
        image: gcr.io/ml-pipeline/scheduledworkflow:2.0.3
        imagePullPolicy: IfNotPresent
        name: ml-pipeline-scheduledworkflow
      serviceAccountName: ml-pipeline-scheduledworkflow
---
apiVersion: v1
data:
  ConMaxLifeTime: 120s
  DEFAULT_CACHE_STALENESS: ""
  MAXIMUM_CACHE_STALENESS: ""
  appName: pipeline
  appVersion: 2.0.3
  autoUpdatePipelineDefaultVersion: "true"
  bucketName: mlpipeline
  cacheDb: cachedb
  cacheImage: gcr.io/google-containers/busybox
  cacheNodeRestrictions: "false"
  cronScheduleTimezone: UTC
  dbHost: mysql
  dbPort: "3306"
  dbType: mysql
  defaultPipelineRoot: ""
  mlmdDb: metadb
  mysqlHost: mysql
  mysqlPort: "3306"
  pipelineDb: mlpipeline
  warning: |
    1. Do not use kubectl to edit this configmap, because some values are used
    during kustomize build. Instead, change the configmap and apply the entire
    kustomize manifests again.
    2. After updating the configmap, some deployments may need to be restarted
    until the changes take effect. A quick way to restart all deployments in a
    namespace: `kubectl rollout restart deployment -n <your-namespace>`.
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: ml-pipeline
    app.kubernetes.io/name: kubeflow-pipelines
    application-crd-id: kubeflow-pipelines
  name: pipeline-install-config
  namespace: kubeflow
mvlassis commented 1 week ago

On Upstream Kubeflow v.1.8, I edited the deployment of ml-pipeline-scheduledworkflow, swapping the upstream image with the following rock: charmedkubeflow/scheduledworkflow:2.0.5-f6d0763, and I can confirm that recurring runs do work.

sombrafam commented 1 week ago

Do you still need to apply the Manifests with this image?

mvlassis commented 1 week ago

I ran the following command: kubectl edit deployment -n kubeflow ml-pipeline-scheduledworkflow, and then changed the value of image

DnPlas commented 5 days ago

Hi folks,

I think I have found the issue. The command that the kfp-schedwf was executing defines the --namespace={self.model.name}, which is not entirely correct as, according to upstream, this value should be "".

After I applied the following patch, rebuilt and refreshed the charm, scheduled workflows started working:

$ git diff
diff --git a/charms/kfp-schedwf/src/components/pebble_component.py b/charms/kfp-schedwf/src/components/pebble_component.py
index f10351a..8dbd593 100644
--- a/charms/kfp-schedwf/src/components/pebble_component.py
+++ b/charms/kfp-schedwf/src/components/pebble_component.py
@@ -18,7 +18,7 @@ class KfpSchedwfPebbleService(PebbleServiceComponent):
     ):
         """Pebble service container component in order to configure Pebble layer"""
         super().__init__(*args, **kwargs)
-        self.environment = {"CRON_SCHEDULE_TIMEZONE": timezone}
+        self.environment = {"CRON_SCHEDULE_TIMEZONE": timezone, "NAMESPACE": ""}
         self.namespace = namespace

     def get_layer(self) -> Layer:
@@ -42,7 +42,7 @@ class KfpSchedwfPebbleService(PebbleServiceComponent):
                         "summary": "scheduled workflow controller service",
                         "startup": "enabled",
                         "command": "/bin/controller --logtostderr=true"
-                        " --namespace={self.namespace}",
+                        ' --namespace=""',
                         "environment": self.environment,
                     }
                 },

I am attaching an image that shows scheduled workflows running after applying the above patch:

image

@mvlassis let's apply this change in both track/2.0 and main, as it is affecting both branches. Let's also try to increase the integration tests coverage (as much as we can) to avoid this issue in the future.

Please also note that our rock is using the right value, but since we are replacing that layer with the one in the charm, it was not used at all.

kimwnasptd commented 4 days ago

Great work @DnPlas!

kimwnasptd commented 4 days ago

And this can explain why the workers started but no progress was made. Since the controller was only monitoring for Recurring Runs in the kubeflow namespace

DnPlas commented 3 days ago

@alelucrod @eleblebici @sombrafam we have released the fix in the 2.0/stable channel, you should be able to refresh and verify recurring runs are working now. You should get revision 1466 with juju refresh kfp-schedwf --channel 2.0/stable.

@mvlassis is still working on #529 to make this change in latest/edge.

sombrafam commented 2 days ago

@ @DnPlas @mvlassis Thanks for our work and assistance guys.