Closed alelucrod closed 3 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5295.
This message was autogenerated
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5884.
This message was autogenerated
After deploying Kubeflow 1.8/stable with Microk8s 1.26, I can confirm that periodic recurring runs will not be scheduled. After investigation, we found out that kfp-schedwf
is reporting the following error:
2024-06-20T10:52:46.510Z [controller] W0620 10:52:46.504648 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:46.510Z [controller] E0620 10:52:46.504675 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:47.706Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.445816616s 200
2024-06-20T10:52:47.732Z [pebble] GET /v1/services 84.622µs 200
2024-06-20T10:52:47.854Z [controller] W0620 10:52:47.854466 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:47.854Z [controller] E0620 10:52:47.854489 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:49.478Z [controller] W0620 10:52:49.478066 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:49.478Z [controller] E0620 10:52:49.478092 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:50.570Z [pebble] GET /v1/services 57.824µs 200
2024-06-20T10:52:53.600Z [controller] W0620 10:52:53.600150 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:53.600Z [controller] E0620 10:52:53.600192 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:52:55.153Z [pebble] GET /v1/plan?format=yaml 121.998µs 200
2024-06-20T10:52:55.175Z [pebble] GET /v1/services 55.578µs 200
2024-06-20T10:52:57.715Z [pebble] GET /v1/services 49.722µs 200
2024-06-20T10:53:01.812Z [controller] W0620 10:53:01.812605 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:53:01.812Z [controller] E0620 10:53:01.812726 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:53:03.067Z [pebble] GET /v1/plan?format=yaml 103.287µs 200
2024-06-20T10:53:03.074Z [pebble] GET /v1/services 37.85µs 200
2024-06-20T10:53:05.112Z [pebble] GET /v1/services 40.338µs 200
2024-06-20T10:53:09.950Z [pebble] GET /v1/plan?format=yaml 144.632µs 200
2024-06-20T10:53:09.974Z [pebble] GET /v1/services 51.045µs 200
2024-06-20T10:53:12.323Z [pebble] GET /v1/services 44.12µs 200
2024-06-20T10:53:23.002Z [controller] W0620 10:53:23.000721 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:53:23.002Z [controller] E0620 10:53:23.000751 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:54:09.293Z [controller] W0620 10:54:09.293438 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:54:09.293Z [controller] E0620 10:54:09.293659 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:55:02.657Z [controller] W0620 10:55:02.656928 14 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
2024-06-20T10:55:02.657Z [controller] E0620 10:55:02.656960 14 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.3/tools/cache/reflector.go:167: Failed to watch *v1alpha1.Workflow: failed to list *v1alpha1.Workflow: the server could not find the requested resource (get workflows.argoproj.io)
After removing and redeploying the kfp-schedwf
charm, the scheduled runs still do not work, but the error disappears:
2024-06-21T11:08:50.146Z [pebble] HTTP API server listening on ":38813".
2024-06-21T11:08:50.146Z [pebble] Started daemon.
2024-06-21T11:09:12.032Z [pebble] GET /v1/plan?format=yaml 2.278608ms 200
2024-06-21T11:09:12.033Z [pebble] POST /v1/layers 185.549µs 200
2024-06-21T11:09:12.042Z [pebble] POST /v1/services 8.225642ms 202
2024-06-21T11:09:12.054Z [pebble] Service "controller" starting: /bin/controller --logtostderr=true --namespace={self.namespace}
2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Location: UTC"
2024-06-21T11:09:12.105Z [controller] W0621 11:09:12.105200 15 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Creating event broadcaster"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Setting up event handlers"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting ScheduledWorkflow controller"
2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Waiting for informer caches to sync"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting workers"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Started workers"
2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Wait for shut down"
2024-06-21T11:09:13.067Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.024139014s 200
2024-06-21T11:09:13.081Z [pebble] GET /v1/services 86.216µs 200
2024-06-21T11:09:16.649Z [pebble] GET /v1/services 32.189µs 200
2024-06-21T11:09:20.912Z [pebble] GET /v1/plan?format=yaml 98.649µs 200
2024-06-21T11:09:20.923Z [pebble] GET /v1/services 38.434µs 200
2024-06-21T11:09:24.588Z [pebble] GET /v1/services 39.909µs 200
2024-06-21T11:09:28.905Z [pebble] GET /v1/plan?format=yaml 120.975µs 200
2024-06-21T11:09:28.915Z [pebble] GET /v1/services 34.192µs 200
2024-06-21T11:09:32.505Z [pebble] GET /v1/services 52.596µs 200
2024-06-21T11:09:36.694Z [pebble] GET /v1/plan?format=yaml 115.939µs 200
2024-06-21T11:09:36.711Z [pebble] GET /v1/services 61.221µs 200
2024-06-21T11:09:40.229Z [pebble] GET /v1/services 36.619µs 200
I also checked that the charm is trusted, and that the crds and the RBAC manifests match those on the upstream.
After testing an upstream 1.8 Kubeflow deployment, I can confirm that scheduled runs do work there.
These are the logs from kfp-api
, there seems to be an error with an empty namespace:
2024-06-21T12:26:23.424Z [apiserver] I0621 12:26:23.423747 74 error.go:278] Invalid input error: a recurring run cannot have an empty namespace in multi-user mode
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.NewInvalidInputError
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:185
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).canAccessJob
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/server/job_server.go:421
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).listJobs
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/server/job_server.go:167
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).ListRecurringRuns
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/server/job_server.go:345
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler.func1
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1698
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/interceptor.go:30
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver] /snap/go/10584/src/runtime/asm_amd64.s:1650
2024-06-21T12:26:23.424Z [apiserver] Failed to list recurring runs due to authorization error. Check if you have permission to access namespace
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrapf
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:266
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.Wrapf
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:337
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).listJobs
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/server/job_server.go:169
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).ListRecurringRuns
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/server/job_server.go:345
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler.func1
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1698
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/interceptor.go:30
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver] /snap/go/10584/src/runtime/asm_amd64.s:1650
2024-06-21T12:26:23.424Z [apiserver] Failed to list jobs
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrap
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:271
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.Wrap
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:350
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/apiserver/server.(*JobServer).ListRecurringRuns
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/server/job_server.go:347
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler.func1
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1698
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/interceptor.go:30
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver] /snap/go/10584/src/runtime/asm_amd64.s:1650
2024-06-21T12:26:23.424Z [apiserver] /kubeflow.pipelines.backend.api.v2beta1.RecurringRunService/ListRecurringRuns call failed
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.(*UserError).wrapf
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:266
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/src/common/util.Wrapf
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/common/util/error.go:337
2024-06-21T12:26:23.424Z [apiserver] main.apiServerInterceptor
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/src/apiserver/interceptor.go:32
2024-06-21T12:26:23.424Z [apiserver] github.com/kubeflow/pipelines/backend/api/v2beta1/go_client._RecurringRunService_ListRecurringRuns_Handler
2024-06-21T12:26:23.424Z [apiserver] /root/parts/builder/build/backend/api/v2beta1/go_client/recurring_run.pb.go:1700
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).processUnaryRPC
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1282
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).handleStream
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:1616
2024-06-21T12:26:23.424Z [apiserver] google.golang.org/grpc.(*Server).serveStreams.func1.2
2024-06-21T12:26:23.424Z [apiserver] /root/go/pkg/mod/google.golang.org/grpc@v1.44.0/server.go:921
2024-06-21T12:26:23.424Z [apiserver] runtime.goexit
2024-06-21T12:26:23.424Z [apiserver] /snap/go/10584/src/runtime/asm_amd64.s:1650
After removing and redeploying the
kfp-schedwf
charm, the scheduled runs still do not work, but the error disappears:2024-06-21T11:08:50.146Z [pebble] HTTP API server listening on ":38813". 2024-06-21T11:08:50.146Z [pebble] Started daemon. 2024-06-21T11:09:12.032Z [pebble] GET /v1/plan?format=yaml 2.278608ms 200 2024-06-21T11:09:12.033Z [pebble] POST /v1/layers 185.549µs 200 2024-06-21T11:09:12.042Z [pebble] POST /v1/services 8.225642ms 202 2024-06-21T11:09:12.054Z [pebble] Service "controller" starting: /bin/controller --logtostderr=true --namespace={self.namespace} 2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Location: UTC" 2024-06-21T11:09:12.105Z [controller] W0621 11:09:12.105200 15 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. 2024-06-21T11:09:12.105Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Creating event broadcaster" 2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Setting up event handlers" 2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting ScheduledWorkflow controller" 2024-06-21T11:09:12.106Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Waiting for informer caches to sync" 2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Starting workers" 2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Started workers" 2024-06-21T11:09:12.206Z [controller] time="2024-06-21T11:09:12Z" level=info msg="Wait for shut down" 2024-06-21T11:09:13.067Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.024139014s 200 2024-06-21T11:09:13.081Z [pebble] GET /v1/services 86.216µs 200 2024-06-21T11:09:16.649Z [pebble] GET /v1/services 32.189µs 200 2024-06-21T11:09:20.912Z [pebble] GET /v1/plan?format=yaml 98.649µs 200 2024-06-21T11:09:20.923Z [pebble] GET /v1/services 38.434µs 200 2024-06-21T11:09:24.588Z [pebble] GET /v1/services 39.909µs 200 2024-06-21T11:09:28.905Z [pebble] GET /v1/plan?format=yaml 120.975µs 200 2024-06-21T11:09:28.915Z [pebble] GET /v1/services 34.192µs 200 2024-06-21T11:09:32.505Z [pebble] GET /v1/services 52.596µs 200 2024-06-21T11:09:36.694Z [pebble] GET /v1/plan?format=yaml 115.939µs 200 2024-06-21T11:09:36.711Z [pebble] GET /v1/services 61.221µs 200 2024-06-21T11:09:40.229Z [pebble] GET /v1/services 36.619µs 200
I also checked that the charm is trusted, and that the crds and the RBAC manifests match those on the upstream.
After testing an upstream 1.8 Kubeflow deployment, I can confirm that scheduled runs do work there.
Can you share the steps that you use it to deploy the upstream deployment?
kubectl apply
all manifests from the upstream deployment that are related to ml-pipeline-scheduledworkflow
, and the recurring runs work.kfp-schedwf
charm taken directly from upstream (the v.1.8 branch): gcr.io/ml-pipeline/scheduledworkflow:2.0.5. The recurring runs then do not work.Hi @mvlassis, can you share more details about that procedure? Where can I get those manifests so I can try?
@sombrafam The procedure is as follows:
kustomize build example > manifests-1-8.yaml
scheduledworkflow.yaml
scheduledworkflow.yaml
.The recurring runs should then work
Hi @sombrafam @mvlassis , I also tried the workaround mentioned in this issue #352
After redeploying the kfp-schedwf with juju deploy kfp-schedwf --channel 1.7/stable --resource oci-image=gcr.io/ml-pipeline/scheduledworkflow:2.0.0-alpha.
it first removed the current scheduledworkflow. I created a new recurring run and a new scheduledworkflow is also created. With the older one the runs were not triggered, but after I redeploy it, it triggered the runs though they are not running and giving an error (resource failed to execute).
@mvlassis can you please share the scheduledworkflow.yaml
file you used?
@mvlassis can you please share the
scheduledworkflow.yaml
file you used?
Ok, I created it and after deploying the below the recurring runs started to work
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: ml-pipeline-scheduledworkflow
namespace: kubeflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app: ml-pipeline-scheduledworkflow-role
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: ml-pipeline-scheduledworkflow-role
namespace: kubeflow
rules:
- apiGroups:
- argoproj.io
resources:
- workflows
verbs:
- create
- get
- list
- watch
- update
- patch
- delete
- apiGroups:
- kubeflow.org
resources:
- scheduledworkflows
- scheduledworkflows/finalizers
verbs:
- create
- get
- list
- watch
- update
- patch
- delete
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
---
aggregationRule:
clusterRoleSelectors:
- matchLabels:
duck.knative.dev/addressable: "true"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: knative-eventing
app.kubernetes.io/name: knative-eventing
app.kubernetes.io/version: 1.10.1
kustomize.component: knative
name: addressable-resolver
rules: []
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app: poddefaults
app.kubernetes.io/component: poddefaults
app.kubernetes.io/name: poddefaults
kustomize.component: poddefaults
name: admission-webhook-cluster-role
rules:
- apiGroups:
- kubeflow.org
resources:
- poddefaults
verbs:
- get
- watch
- list
- update
- create
- patch
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: ml-pipeline-scheduledworkflow-role
rules:
- apiGroups:
- argoproj.io
resources:
- workflows
verbs:
- create
- get
- list
- watch
- update
- patch
- delete
- apiGroups:
- kubeflow.org
resources:
- scheduledworkflows
- scheduledworkflows/finalizers
verbs:
- create
- get
- list
- watch
- update
- patch
- delete
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: ml-pipeline-scheduledworkflow-binding
namespace: kubeflow
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: ml-pipeline-scheduledworkflow-role
subjects:
- kind: ServiceAccount
name: ml-pipeline-scheduledworkflow
namespace: kubeflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: ml-pipeline-scheduledworkflow-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ml-pipeline-scheduledworkflow-role
subjects:
- kind: ServiceAccount
name: ml-pipeline-scheduledworkflow
namespace: kubeflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: ml-pipeline-scheduledworkflow
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: ml-pipeline-scheduledworkflow
namespace: kubeflow
spec:
selector:
matchLabels:
app: ml-pipeline-scheduledworkflow
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
labels:
app: ml-pipeline-scheduledworkflow
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
spec:
containers:
- env:
- name: NAMESPACE
value: ""
- name: CRON_SCHEDULE_TIMEZONE
valueFrom:
configMapKeyRef:
key: cronScheduleTimezone
name: pipeline-install-config
image: gcr.io/ml-pipeline/scheduledworkflow:2.0.3
imagePullPolicy: IfNotPresent
name: ml-pipeline-scheduledworkflow
serviceAccountName: ml-pipeline-scheduledworkflow
---
apiVersion: v1
data:
ConMaxLifeTime: 120s
DEFAULT_CACHE_STALENESS: ""
MAXIMUM_CACHE_STALENESS: ""
appName: pipeline
appVersion: 2.0.3
autoUpdatePipelineDefaultVersion: "true"
bucketName: mlpipeline
cacheDb: cachedb
cacheImage: gcr.io/google-containers/busybox
cacheNodeRestrictions: "false"
cronScheduleTimezone: UTC
dbHost: mysql
dbPort: "3306"
dbType: mysql
defaultPipelineRoot: ""
mlmdDb: metadb
mysqlHost: mysql
mysqlPort: "3306"
pipelineDb: mlpipeline
warning: |
1. Do not use kubectl to edit this configmap, because some values are used
during kustomize build. Instead, change the configmap and apply the entire
kustomize manifests again.
2. After updating the configmap, some deployments may need to be restarted
until the changes take effect. A quick way to restart all deployments in a
namespace: `kubectl rollout restart deployment -n <your-namespace>`.
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/component: ml-pipeline
app.kubernetes.io/name: kubeflow-pipelines
application-crd-id: kubeflow-pipelines
name: pipeline-install-config
namespace: kubeflow
On Upstream Kubeflow v.1.8, I edited the deployment of ml-pipeline-scheduledworkflow
, swapping the upstream image with the following rock: charmedkubeflow/scheduledworkflow:2.0.5-f6d0763
, and I can confirm that recurring runs do work.
Do you still need to apply the Manifests with this image?
I ran the following command: kubectl edit deployment -n kubeflow ml-pipeline-scheduledworkflow
, and then changed the value of image
Hi folks,
I think I have found the issue. The command that the kfp-schedwf
was executing defines the --namespace={self.model.name}
, which is not entirely correct as, according to upstream, this value should be "".
After I applied the following patch, rebuilt and refreshed the charm, scheduled workflows started working:
$ git diff
diff --git a/charms/kfp-schedwf/src/components/pebble_component.py b/charms/kfp-schedwf/src/components/pebble_component.py
index f10351a..8dbd593 100644
--- a/charms/kfp-schedwf/src/components/pebble_component.py
+++ b/charms/kfp-schedwf/src/components/pebble_component.py
@@ -18,7 +18,7 @@ class KfpSchedwfPebbleService(PebbleServiceComponent):
):
"""Pebble service container component in order to configure Pebble layer"""
super().__init__(*args, **kwargs)
- self.environment = {"CRON_SCHEDULE_TIMEZONE": timezone}
+ self.environment = {"CRON_SCHEDULE_TIMEZONE": timezone, "NAMESPACE": ""}
self.namespace = namespace
def get_layer(self) -> Layer:
@@ -42,7 +42,7 @@ class KfpSchedwfPebbleService(PebbleServiceComponent):
"summary": "scheduled workflow controller service",
"startup": "enabled",
"command": "/bin/controller --logtostderr=true"
- " --namespace={self.namespace}",
+ ' --namespace=""',
"environment": self.environment,
}
},
I am attaching an image that shows scheduled workflows running after applying the above patch:
@mvlassis let's apply this change in both track/2.0
and main
, as it is affecting both branches. Let's also try to increase the integration tests coverage (as much as we can) to avoid this issue in the future.
Please also note that our rock is using the right value, but since we are replacing that layer with the one in the charm, it was not used at all.
Great work @DnPlas!
And this can explain why the workers started but no progress was made. Since the controller was only monitoring for Recurring Runs in the kubeflow
namespace
@alelucrod @eleblebici @sombrafam we have released the fix in the 2.0/stable
channel, you should be able to refresh and verify recurring runs are working now. You should get revision 1466 with juju refresh kfp-schedwf --channel 2.0/stable
.
@mvlassis is still working on #529 to make this change in latest/edge
.
@ @DnPlas @mvlassis Thanks for our work and assistance guys.
Bug Description
I am experiencing an issue after a fresh installation of Kubeflow 1.8/stable (following the official guide). I can launch manual runs, and they execute successfully. However, recurring runs, whether they are Periodic or Cron, do not launch.
In contrast, if I install Kubeflow 1.7 with MicroK8s 1.24, recurring runs do work. Is anyone else experiencing the same issue?"
To Reproduce
Fresh install (https://charmed-kubeflow.io/docs/get-started-with-charmed-kubeflow)
Environment
Ubuntu 23.10 Microk8s 1.26 Charmed-kubeflow 1.8/stable
Relevant Log Output
Additional Context
No response