canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

Cannot deploy bundle with kfp-api 1.7/stable rev 866 #735

Closed Barteus closed 10 months ago

Barteus commented 10 months ago

Bug Description

When deploying the bundle from "bundle-kubeflow" repository, Juju deploy fails on kfp-persistence trying to check the kfp-api heathcheck.

To Reproduce

  1. download bundle from repository
  2. juju deploy ./bundle.yaml
  3. juju status -> not finishes

Environment

Relevant Log Output

$ kubectl logs kfp-api-0 -n kubeflow
2023-10-27T07:14:21.608Z [container-agent] 2023-10-27 07:14:21 ERROR juju-log Failed update status with error: Workload failed health check
2023-10-27T07:14:22.011Z [container-agent] 2023-10-27 07:14:22 INFO juju.worker.uniter.operation runhook.go:159 ran "update-status" hook (via hook dispatching script: dispatch)
2023-10-27T07:17:49.237Z [container-agent] 2023-10-27 07:17:49 INFO juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/kubeflow/services/kfp-api "HTTP/1.1 200 OK"
2023-10-27T07:17:59.451Z [container-agent] 2023-10-27 07:17:59 INFO juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/kfp-api "HTTP/1.1 200 OK"
2023-10-27T07:17:59.530Z [container-agent] 2023-10-27 07:17:59 INFO juju-log Kubernetes service 'kfp-api' patched successfully
2023-10-27T07:18:01.322Z [container-agent] 2023-10-27 07:18:01 INFO juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
2023-10-27T07:18:01.592Z [container-agent] 2023-10-27 07:18:01 INFO juju-log Rendering manifests
2023-10-27T07:18:11.855Z [container-agent] 2023-10-27 07:18:11 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kfp-api?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
2023-10-27T07:18:22.049Z [container-agent] 2023-10-27 07:18:22 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/kfp-api?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
2023-10-27T07:18:32.270Z [container-agent] 2023-10-27 07:18:32 INFO juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/ml-pipeline?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"
2023-10-27T07:18:32.360Z [container-agent] 2023-10-27 07:18:32 INFO juju-log Reconcile completed successfully
2023-10-27T07:18:32.660Z [container-agent] 2023-10-27 07:18:32 ERROR juju-log Container apiserver failed health check. It will be restarted.
2023-10-27T07:18:32.770Z [container-agent] 2023-10-27 07:18:32 ERROR juju-log Failed update status with error: Workload failed health check
2023-10-27T07:18:33.156Z [container-agent] 2023-10-27 07:18:33 INFO juju.worker.uniter.operation runhook.go:159 ran "update-status" hook (via hook dispatching script: dispatch)
2023-10-27T07:23:34.347Z [container-agent] 2023-10-27 07:23:34 INFO juju-log HTTP Request: GET https://10.152.183.1/api/v1/namespaces/kubeflow/services/kfp-api "HTTP/1.1 200 OK"
2023-10-27T07:23:44.521Z [container-agent] 2023-10-27 07:23:44 INFO juju-log HTTP Request: PATCH https://10.152.183.1/api/v1/namespaces/kubeflow/services/kfp-api "HTTP/1.1 200 OK"
2023-10-27T07:23:44.603Z [container-agent] 2023-10-27 07:23:44 INFO juju-log Kubernetes service 'kfp-api' patched successfully
2023-10-27T07:23:46.378Z [container-agent] 2023-10-27 07:23:46 INFO juju-log HTTP Request: GET https://10.152.183.1/apis/apiextensions.k8s.io/v1/customresourcedefinitions "HTTP/1.1 200 OK"
2023-10-27T07:23:46.634Z [container-agent] 2023-10-27 07:23:46 INFO juju-log Rendering manifests
2023-10-27T07:23:56.930Z [container-agent] 2023-10-27 07:23:56 INFO juju-log HTTP Request: PATCH https://10.152.183.1/apis/rbac.authorization.k8s.io/v1/clusterroles/kfp-api?force=true&fieldManager=lightkube "HTTP/1.1 200 OK"

Additional Context

No response

DnPlas commented 10 months ago

Hey @Barteus just to understand this issue better, after deploying CKF using the 1.7/stable bundle definition, kfp-api goes to an error state? Is it the only charm that is failing?

To help us debug better, could you please share the output of 'juju status kfp-api' as well as the logs from the apiserver container (kubectl logs -nkubeflow kfp-api-0 -c apiserver)? Can you also check if the kfp-db is active?

gustavosr98 commented 10 months ago

Hey @Barteus I deployed Kubeflow 1.7/stable yerterday I have the same kfp-api 866 rev I don't seem to be seen the error

Only diff I have from your env is that I deployed on Microk8s

Juju Status

juju status
Model     Controller          Cloud/Region        Version  SLA          Timestamp
kubeflow  microk8s-localhost  microk8s/localhost  2.9.45   unsupported  14:33:48Z

App                        Version                  Status  Scale  Charm                    Channel         Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b    active      1  admission-webhook        1.7/stable      224  10.152.183.8    no       
argo-controller            res:oci-image@3902c16    active      1  argo-controller          3.3/stable      376                  no       
argo-server                res:oci-image@e2292c9    active      1  argo-server              3.3/stable      309                  no       
dex-auth                                            active      1  dex-auth                 2.31/stable     346  10.152.183.6    no       
istio-ingressgateway                                active      1  istio-gateway            1.16/stable     663  10.152.183.108  no       
istio-pilot                                         active      1  istio-pilot              1.16/stable     662  10.152.183.161  no       
jupyter-controller         res:oci-image@1167186    active      1  jupyter-controller       1.7/stable      805                  no       
jupyter-ui                                          active      1  jupyter-ui               1.7/stable      727  10.152.183.187  no       
katib-controller           res:oci-image@111495a    active      1  katib-controller         0.15/stable     282  10.152.183.188  no       
katib-db                   8.0.34-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       99  10.152.183.237  no       
katib-db-manager                                    active      1  katib-db-manager         0.15/stable     253  10.152.183.147  no       
katib-ui                                            active      1  katib-ui                 0.15/stable     267  10.152.183.243  no       
kfp-api                                             active      1  kfp-api                  2.0/stable      866  10.152.183.122  no       
kfp-db                     8.0.34-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       99  10.152.183.7    no       
kfp-persistence            res:oci-image@ebed770    active      1  kfp-persistence          2.0/stable      870                  no       
kfp-profile-controller     res:oci-image@aa75b0c    active      1  kfp-profile-controller   2.0/stable      831  10.152.183.143  no       
kfp-schedwf                res:oci-image@2cb9087    active      1  kfp-schedwf              2.0/stable      932                  no       
kfp-ui                     res:oci-image@ae72602    active      1  kfp-ui                   2.0/stable      865  10.152.183.56   no       
kfp-viewer                 res:oci-image@899e25f    active      1  kfp-viewer               2.0/stable      895                  no       
kfp-viz                    res:oci-image@ffaf37e    active      1  kfp-viz                  2.0/stable      822  10.152.183.229  no       
knative-eventing                                    active      1  knative-eventing         1.8/stable      345  10.152.183.139  no       
knative-operator                                    active      1  knative-operator         1.8/stable      320  10.152.183.31   no       
knative-serving                                     active      1  knative-serving          1.8/stable      346  10.152.183.36   no       
kserve-controller                                   active      1  kserve-controller        0.10/stable     394  10.152.183.184  no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.7/stable      439  10.152.183.159  no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.7/stable      336  10.152.183.216  no       
kubeflow-roles                                      active      1  kubeflow-roles           1.7/stable      148  10.152.183.49   no       
kubeflow-volumes           res:oci-image@d261609    active      1  kubeflow-volumes         1.7/stable      204  10.152.183.252  no       
metacontroller-operator                             active      1  metacontroller-operator  2.0/stable      204  10.152.183.220  no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.7/stable  214  10.152.183.197  no       
mlflow-minio               res:oci-image@1755999    active      1  minio                    ckf-1.7/stable  214  10.152.183.230  no       
mlflow-mysql               8.0.34-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       99  10.152.183.144  no       
mlflow-server                                       active      1  mlflow-server            2.1/stable      466  10.152.183.200  no       
oidc-gatekeeper            res:oci-image@6b720b8    active      1  oidc-gatekeeper          ckf-1.7/stable  269  10.152.183.198  no       
seldon-controller-manager                           active      1  seldon-core              1.15/stable     548  10.152.183.135  no       
tensorboard-controller     res:oci-image@c52f7c2    active      1  tensorboard-controller   1.7/stable      156  10.152.183.52   no       
tensorboards-web-app       res:oci-image@929f55b    active      1  tensorboards-web-app     1.7/stable      158  10.152.183.162  no       
training-operator                                   active      1  training-operator        1.6/stable      305  10.152.183.221  no       

Unit                          Workload  Agent  Address       Ports              Message
admission-webhook/0*          active    idle   10.1.134.142  4443/TCP           
argo-controller/0*            active    idle   10.1.134.204                     
argo-server/0*                active    idle   10.1.134.144  2746/TCP           
dex-auth/0*                   active    idle   10.1.134.141                     
istio-ingressgateway/0*       active    idle   10.1.134.143                     
istio-pilot/0*                active    idle   10.1.134.146                     
jupyter-controller/0*         active    idle   10.1.134.148                     
jupyter-ui/0*                 active    idle   10.1.134.150                     
katib-controller/0*           active    idle   10.1.134.153  443/TCP,8080/TCP   
katib-db-manager/0*           active    idle   10.1.134.155                     
katib-db/0*                   active    idle   10.1.134.154                     Primary
katib-ui/0*                   active    idle   10.1.134.156                     
kfp-api/0*                    active    idle   10.1.134.157                     
kfp-db/0*                     active    idle   10.1.134.159                     Primary
kfp-persistence/0*            active    idle   10.1.134.205                     
kfp-profile-controller/0*     active    idle   10.1.134.203  80/TCP             
kfp-schedwf/0*                active    idle   10.1.134.192                     
kfp-ui/0*                     active    idle   10.1.134.206  3000/TCP           
kfp-viewer/0*                 active    idle   10.1.134.134                     
kfp-viz/0*                    active    idle   10.1.134.158  8888/TCP           
knative-eventing/0*           active    idle   10.1.134.160                     
knative-operator/0*           active    idle   10.1.134.165                     
knative-serving/0*            active    idle   10.1.134.161                     
kserve-controller/0*          active    idle   10.1.134.166                     
kubeflow-dashboard/0*         active    idle   10.1.134.164                     
kubeflow-profiles/0*          active    idle   10.1.134.168                     
kubeflow-roles/0*             active    idle   10.1.134.162                     
kubeflow-volumes/0*           active    idle   10.1.134.199  5000/TCP           
metacontroller-operator/0*    active    idle   10.1.134.163                     
minio/0*                      active    idle   10.1.134.202  9000/TCP,9001/TCP  
mlflow-minio/0*               active    idle   10.1.134.213  9000/TCP,9001/TCP  
mlflow-mysql/0*               active    idle   10.1.134.210                     Primary
mlflow-server/0*              active    idle   10.1.134.211                     
oidc-gatekeeper/0*            active    idle   10.1.134.207  8080/TCP           
seldon-controller-manager/0*  active    idle   10.1.134.167                     
tensorboard-controller/0*     active    idle   10.1.134.201  9443/TCP           
tensorboards-web-app/0*       active    idle   10.1.134.198  5000/TCP           
training-operator/0*          active    idle   10.1.134.169

Logs

# ubuntu@ip-172-31-65-245:~$ microk8s.kubectl logs kfp-api-0 -n kubeflow | grep -i error | grep health
# Empty

# $ microk8s.kubectl logs kfp-api-0 -n kubeflow | grep -i error
Defaulted container "charm" out of: charm, apiserver, charm-init (init)
2023-10-26T20:25:01.736Z [container-agent] 2023-10-26 20:25:01 ERROR juju-log Failed to handle <LeaderElectedEvent via KfpApiOperator/on/leader_elected[31]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:03.021Z [container-agent] 2023-10-26 20:25:03 ERROR juju-log Failed to handle <ConfigChangedEvent via KfpApiOperator/on/config_changed[36]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:10.708Z [container-agent] 2023-10-26 20:25:10 ERROR juju-log Failed to handle <PebbleReadyEvent via KfpApiOperator/on/apiserver_pebble_ready[46]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:11.935Z [container-agent] 2023-10-26 20:25:11 ERROR juju-log relational-db:20: Failed to handle <RelationJoinedEvent via KfpApiOperator/on/relational_db_relation_joined[51]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:13.175Z [container-agent] 2023-10-26 20:25:13 ERROR juju-log relational-db:20: Failed to handle <RelationChangedEvent via KfpApiOperator/on/relational_db_relation_changed[56]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:26:01.256Z [container-agent] 2023-10-26 20:26:01 ERROR juju-log relational-db:20: Failed to handle <RelationChangedEvent via KfpApiOperator/on/relational_db_relation_changed[61]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:26:01.867Z [container-agent] 2023-10-26 20:26:01 ERROR juju-log relational-db:20: Failed to handle <DatabaseCreatedEvent via KfpApiOperator/DatabaseRequires[relational-db]/on/database_created[62]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:29.166Z [container-agent] 2023-10-26 20:27:29 ERROR juju-log kfp-viz:23: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_viz_relation_changed[72]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:30.622Z [container-agent] 2023-10-26 20:27:30 ERROR juju-log kfp-viz:23: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_viz_relation_changed[77]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:37.687Z [container-agent] 2023-10-26 20:27:37 ERROR juju-log kfp-viz:23: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_viz_relation_changed[82]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:42.149Z [container-agent] 2023-10-26 20:27:42 ERROR juju-log kfp-api:21: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[92]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:44.614Z [container-agent] 2023-10-26 20:27:44 ERROR juju-log kfp-api:21: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[97]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:28:13.456Z [container-agent] 2023-10-26 20:28:13 ERROR juju-log kfp-api:22: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[107]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:28:56.110Z [container-agent] 2023-10-26 20:28:56 ERROR juju-log object-storage:24: Failed to handle <RelationChangedEvent via KfpApiOperator/on/object_storage_relation_changed[117]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:29:08.317Z [container-agent] 2023-10-26 20:29:08 ERROR juju-log Failed to handle <UpdateStatusEvent via KfpApiOperator/on/update_status[122]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:29:45.562Z [container-agent] 2023-10-26 20:29:45 ERROR juju-log object-storage:24: Failed to generate container configuration.
2023-10-26T20:29:45.668Z [container-agent] 2023-10-26 20:29:45 ERROR juju-log object-storage:24: Failed to handle <RelationChangedEvent via KfpApiOperator/on/object_storage_relation_changed[127]> with error: Waiting for kfp-viz relation data
2023-10-26T20:29:49.275Z [container-agent] 2023-10-26 20:29:49 ERROR juju-log kfp-api:22: Failed to generate container configuration.
2023-10-26T20:29:49.312Z [container-agent] 2023-10-26 20:29:49 ERROR juju-log kfp-api:22: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[132]> with error: Waiting for kfp-viz relation data
natalytvinova commented 10 months ago

Hi @DnPlas, I'm working on the same deployment as @Barteus So juju status (please note that even though it's revision 856 but the error is the same):

ubuntu@infra-1-medma:~$ juju status kfp-api
Model     Controller        Cloud/Region  Version  SLA          Timestamp
kubeflow  foundations-maas  ck8s/default  2.9.44   unsupported  14:36:56Z

App      Version  Status   Scale  Charm    Channel     Rev  Address         Exposed  Message
kfp-api           waiting      1  kfp-api  2.0/stable  856  10.152.183.237  no       waiting for units to settle down

Unit        Workload     Agent  Address         Ports  Message
kfp-api/0*  maintenance  idle   192.168.226.33         Workload failed health check

Container logs:

$ kubectl logs -n kubeflow kfp-api-0 -c apiserver
2023-10-27T14:35:58.178Z [pebble] Service "apiserver" stopped unexpectedly with code 255
2023-10-27T14:35:58.178Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~30s before restart (backoff 410)
2023-10-27T14:36:28.558Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true '
2023-10-27T14:36:29.740Z [apiserver] I1027 14:36:29.740162    5119 client_manager.go:160] Initializing client manager
2023-10-27T14:36:29.740Z [apiserver] I1027 14:36:29.740332    5119 config.go:57] Config DBConfig.ExtraParams not specified, skipping
2023-10-27T14:36:49.760Z [apiserver] F1027 14:36:49.760077    5119 error.go:337] dial tcp: lookup kfp-db-primary.kubeflow.svc.cluster.local: Temporary failure in name resolution
2023-10-27T14:36:49.785Z [pebble] Service "apiserver" stopped unexpectedly with code 255
2023-10-27T14:36:49.785Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~30s before restart (backoff 411)
2023-10-27T14:37:21.414Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true '
2023-10-27T14:37:22.595Z [apiserver] I1027 14:37:22.595638    5132 client_manager.go:160] Initializing client manager
2023-10-27T14:37:22.595Z [apiserver] I1027 14:37:22.595780    5132 config.go:57] Config DBConfig.ExtraParams not specified, skipping
2023-10-27T14:37:42.612Z [apiserver] F1027 14:37:42.612223    5132 error.go:337] dial tcp: lookup kfp-db-primary.kubeflow.svc.cluster.local: Temporary failure in name resolution
2023-10-27T14:37:42.632Z [pebble] Service "apiserver" stopped unexpectedly with code 255
2023-10-27T14:37:42.632Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~30s before restart (backoff 412)
2023-10-27T14:38:14.281Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true '
2023-10-27T14:38:15.484Z [apiserver] I1027 14:38:15.483903    5144 client_manager.go:160] Initializing client manager
2023-10-27T14:38:15.484Z [apiserver] I1027 14:38:15.484023    5144 config.go:57] Config DBConfig.ExtraParams not specified, skipping
DnPlas commented 10 months ago

Thanks for the logs @natalytvinova, I will need also the status of the other kfp-* charms and minio, if you can provide them. Specially the state of the kfp-db charm with juju status kfp-db and kubectl logs of the pod are also very useful.

I also see 2023-10-27T14:37:42.612Z [apiserver] F1027 14:37:42.612223 5132 error.go:337] dial tcp: lookup kfp-db-primary.kubeflow.svc.cluster.local: Temporary failure in name resolution

Could you please confirm that there is a charm called kfp-db-primary and that the service exists (kubectl get svc kfp-db-primary -nkubeflow)?

natalytvinova commented 10 months ago

@DnPlas with the new 1.7 bundle and network issues for k8s fixed on our side, we no longer face this bug

Barteus commented 10 months ago

The issue was the availability of a DNS server between the nodes. Thank you for help!

Closing the issue.