Closed Barteus closed 10 months ago
Hey @Barteus just to understand this issue better, after deploying CKF using the 1.7/stable bundle definition, kfp-api goes to an error state? Is it the only charm that is failing?
To help us debug better, could you please share the output of 'juju status kfp-api' as well as the logs from the apiserver container (kubectl logs -nkubeflow kfp-api-0 -c apiserver)? Can you also check if the kfp-db is active?
Hey @Barteus I deployed Kubeflow 1.7/stable yerterday
I have the same kfp-api
866 rev
I don't seem to be seen the error
Only diff I have from your env is that I deployed on Microk8s
Juju Status
juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow microk8s-localhost microk8s/localhost 2.9.45 unsupported 14:33:48Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@2d74d1b active 1 admission-webhook 1.7/stable 224 10.152.183.8 no
argo-controller res:oci-image@3902c16 active 1 argo-controller 3.3/stable 376 no
argo-server res:oci-image@e2292c9 active 1 argo-server 3.3/stable 309 no
dex-auth active 1 dex-auth 2.31/stable 346 10.152.183.6 no
istio-ingressgateway active 1 istio-gateway 1.16/stable 663 10.152.183.108 no
istio-pilot active 1 istio-pilot 1.16/stable 662 10.152.183.161 no
jupyter-controller res:oci-image@1167186 active 1 jupyter-controller 1.7/stable 805 no
jupyter-ui active 1 jupyter-ui 1.7/stable 727 10.152.183.187 no
katib-controller res:oci-image@111495a active 1 katib-controller 0.15/stable 282 10.152.183.188 no
katib-db 8.0.34-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 99 10.152.183.237 no
katib-db-manager active 1 katib-db-manager 0.15/stable 253 10.152.183.147 no
katib-ui active 1 katib-ui 0.15/stable 267 10.152.183.243 no
kfp-api active 1 kfp-api 2.0/stable 866 10.152.183.122 no
kfp-db 8.0.34-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 99 10.152.183.7 no
kfp-persistence res:oci-image@ebed770 active 1 kfp-persistence 2.0/stable 870 no
kfp-profile-controller res:oci-image@aa75b0c active 1 kfp-profile-controller 2.0/stable 831 10.152.183.143 no
kfp-schedwf res:oci-image@2cb9087 active 1 kfp-schedwf 2.0/stable 932 no
kfp-ui res:oci-image@ae72602 active 1 kfp-ui 2.0/stable 865 10.152.183.56 no
kfp-viewer res:oci-image@899e25f active 1 kfp-viewer 2.0/stable 895 no
kfp-viz res:oci-image@ffaf37e active 1 kfp-viz 2.0/stable 822 10.152.183.229 no
knative-eventing active 1 knative-eventing 1.8/stable 345 10.152.183.139 no
knative-operator active 1 knative-operator 1.8/stable 320 10.152.183.31 no
knative-serving active 1 knative-serving 1.8/stable 346 10.152.183.36 no
kserve-controller active 1 kserve-controller 0.10/stable 394 10.152.183.184 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.7/stable 439 10.152.183.159 no
kubeflow-profiles active 1 kubeflow-profiles 1.7/stable 336 10.152.183.216 no
kubeflow-roles active 1 kubeflow-roles 1.7/stable 148 10.152.183.49 no
kubeflow-volumes res:oci-image@d261609 active 1 kubeflow-volumes 1.7/stable 204 10.152.183.252 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 204 10.152.183.220 no
minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 214 10.152.183.197 no
mlflow-minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 214 10.152.183.230 no
mlflow-mysql 8.0.34-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 99 10.152.183.144 no
mlflow-server active 1 mlflow-server 2.1/stable 466 10.152.183.200 no
oidc-gatekeeper res:oci-image@6b720b8 active 1 oidc-gatekeeper ckf-1.7/stable 269 10.152.183.198 no
seldon-controller-manager active 1 seldon-core 1.15/stable 548 10.152.183.135 no
tensorboard-controller res:oci-image@c52f7c2 active 1 tensorboard-controller 1.7/stable 156 10.152.183.52 no
tensorboards-web-app res:oci-image@929f55b active 1 tensorboards-web-app 1.7/stable 158 10.152.183.162 no
training-operator active 1 training-operator 1.6/stable 305 10.152.183.221 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.134.142 4443/TCP
argo-controller/0* active idle 10.1.134.204
argo-server/0* active idle 10.1.134.144 2746/TCP
dex-auth/0* active idle 10.1.134.141
istio-ingressgateway/0* active idle 10.1.134.143
istio-pilot/0* active idle 10.1.134.146
jupyter-controller/0* active idle 10.1.134.148
jupyter-ui/0* active idle 10.1.134.150
katib-controller/0* active idle 10.1.134.153 443/TCP,8080/TCP
katib-db-manager/0* active idle 10.1.134.155
katib-db/0* active idle 10.1.134.154 Primary
katib-ui/0* active idle 10.1.134.156
kfp-api/0* active idle 10.1.134.157
kfp-db/0* active idle 10.1.134.159 Primary
kfp-persistence/0* active idle 10.1.134.205
kfp-profile-controller/0* active idle 10.1.134.203 80/TCP
kfp-schedwf/0* active idle 10.1.134.192
kfp-ui/0* active idle 10.1.134.206 3000/TCP
kfp-viewer/0* active idle 10.1.134.134
kfp-viz/0* active idle 10.1.134.158 8888/TCP
knative-eventing/0* active idle 10.1.134.160
knative-operator/0* active idle 10.1.134.165
knative-serving/0* active idle 10.1.134.161
kserve-controller/0* active idle 10.1.134.166
kubeflow-dashboard/0* active idle 10.1.134.164
kubeflow-profiles/0* active idle 10.1.134.168
kubeflow-roles/0* active idle 10.1.134.162
kubeflow-volumes/0* active idle 10.1.134.199 5000/TCP
metacontroller-operator/0* active idle 10.1.134.163
minio/0* active idle 10.1.134.202 9000/TCP,9001/TCP
mlflow-minio/0* active idle 10.1.134.213 9000/TCP,9001/TCP
mlflow-mysql/0* active idle 10.1.134.210 Primary
mlflow-server/0* active idle 10.1.134.211
oidc-gatekeeper/0* active idle 10.1.134.207 8080/TCP
seldon-controller-manager/0* active idle 10.1.134.167
tensorboard-controller/0* active idle 10.1.134.201 9443/TCP
tensorboards-web-app/0* active idle 10.1.134.198 5000/TCP
training-operator/0* active idle 10.1.134.169
Logs
# ubuntu@ip-172-31-65-245:~$ microk8s.kubectl logs kfp-api-0 -n kubeflow | grep -i error | grep health
# Empty
# $ microk8s.kubectl logs kfp-api-0 -n kubeflow | grep -i error
Defaulted container "charm" out of: charm, apiserver, charm-init (init)
2023-10-26T20:25:01.736Z [container-agent] 2023-10-26 20:25:01 ERROR juju-log Failed to handle <LeaderElectedEvent via KfpApiOperator/on/leader_elected[31]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:03.021Z [container-agent] 2023-10-26 20:25:03 ERROR juju-log Failed to handle <ConfigChangedEvent via KfpApiOperator/on/config_changed[36]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:10.708Z [container-agent] 2023-10-26 20:25:10 ERROR juju-log Failed to handle <PebbleReadyEvent via KfpApiOperator/on/apiserver_pebble_ready[46]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:11.935Z [container-agent] 2023-10-26 20:25:11 ERROR juju-log relational-db:20: Failed to handle <RelationJoinedEvent via KfpApiOperator/on/relational_db_relation_joined[51]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:25:13.175Z [container-agent] 2023-10-26 20:25:13 ERROR juju-log relational-db:20: Failed to handle <RelationChangedEvent via KfpApiOperator/on/relational_db_relation_changed[56]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:26:01.256Z [container-agent] 2023-10-26 20:26:01 ERROR juju-log relational-db:20: Failed to handle <RelationChangedEvent via KfpApiOperator/on/relational_db_relation_changed[61]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:26:01.867Z [container-agent] 2023-10-26 20:26:01 ERROR juju-log relational-db:20: Failed to handle <DatabaseCreatedEvent via KfpApiOperator/DatabaseRequires[relational-db]/on/database_created[62]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:29.166Z [container-agent] 2023-10-26 20:27:29 ERROR juju-log kfp-viz:23: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_viz_relation_changed[72]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:30.622Z [container-agent] 2023-10-26 20:27:30 ERROR juju-log kfp-viz:23: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_viz_relation_changed[77]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:37.687Z [container-agent] 2023-10-26 20:27:37 ERROR juju-log kfp-viz:23: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_viz_relation_changed[82]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:42.149Z [container-agent] 2023-10-26 20:27:42 ERROR juju-log kfp-api:21: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[92]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:27:44.614Z [container-agent] 2023-10-26 20:27:44 ERROR juju-log kfp-api:21: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[97]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:28:13.456Z [container-agent] 2023-10-26 20:28:13 ERROR juju-log kfp-api:22: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[107]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:28:56.110Z [container-agent] 2023-10-26 20:28:56 ERROR juju-log object-storage:24: Failed to handle <RelationChangedEvent via KfpApiOperator/on/object_storage_relation_changed[117]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:29:08.317Z [container-agent] 2023-10-26 20:29:08 ERROR juju-log Failed to handle <UpdateStatusEvent via KfpApiOperator/on/update_status[122]> with error: List of <ops.model.Relation object-storage:24> versions not found for apps: minio
2023-10-26T20:29:45.562Z [container-agent] 2023-10-26 20:29:45 ERROR juju-log object-storage:24: Failed to generate container configuration.
2023-10-26T20:29:45.668Z [container-agent] 2023-10-26 20:29:45 ERROR juju-log object-storage:24: Failed to handle <RelationChangedEvent via KfpApiOperator/on/object_storage_relation_changed[127]> with error: Waiting for kfp-viz relation data
2023-10-26T20:29:49.275Z [container-agent] 2023-10-26 20:29:49 ERROR juju-log kfp-api:22: Failed to generate container configuration.
2023-10-26T20:29:49.312Z [container-agent] 2023-10-26 20:29:49 ERROR juju-log kfp-api:22: Failed to handle <RelationChangedEvent via KfpApiOperator/on/kfp_api_relation_changed[132]> with error: Waiting for kfp-viz relation data
Hi @DnPlas, I'm working on the same deployment as @Barteus So juju status (please note that even though it's revision 856 but the error is the same):
ubuntu@infra-1-medma:~$ juju status kfp-api
Model Controller Cloud/Region Version SLA Timestamp
kubeflow foundations-maas ck8s/default 2.9.44 unsupported 14:36:56Z
App Version Status Scale Charm Channel Rev Address Exposed Message
kfp-api waiting 1 kfp-api 2.0/stable 856 10.152.183.237 no waiting for units to settle down
Unit Workload Agent Address Ports Message
kfp-api/0* maintenance idle 192.168.226.33 Workload failed health check
Container logs:
$ kubectl logs -n kubeflow kfp-api-0 -c apiserver
2023-10-27T14:35:58.178Z [pebble] Service "apiserver" stopped unexpectedly with code 255
2023-10-27T14:35:58.178Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~30s before restart (backoff 410)
2023-10-27T14:36:28.558Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true '
2023-10-27T14:36:29.740Z [apiserver] I1027 14:36:29.740162 5119 client_manager.go:160] Initializing client manager
2023-10-27T14:36:29.740Z [apiserver] I1027 14:36:29.740332 5119 config.go:57] Config DBConfig.ExtraParams not specified, skipping
2023-10-27T14:36:49.760Z [apiserver] F1027 14:36:49.760077 5119 error.go:337] dial tcp: lookup kfp-db-primary.kubeflow.svc.cluster.local: Temporary failure in name resolution
2023-10-27T14:36:49.785Z [pebble] Service "apiserver" stopped unexpectedly with code 255
2023-10-27T14:36:49.785Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~30s before restart (backoff 411)
2023-10-27T14:37:21.414Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true '
2023-10-27T14:37:22.595Z [apiserver] I1027 14:37:22.595638 5132 client_manager.go:160] Initializing client manager
2023-10-27T14:37:22.595Z [apiserver] I1027 14:37:22.595780 5132 config.go:57] Config DBConfig.ExtraParams not specified, skipping
2023-10-27T14:37:42.612Z [apiserver] F1027 14:37:42.612223 5132 error.go:337] dial tcp: lookup kfp-db-primary.kubeflow.svc.cluster.local: Temporary failure in name resolution
2023-10-27T14:37:42.632Z [pebble] Service "apiserver" stopped unexpectedly with code 255
2023-10-27T14:37:42.632Z [pebble] Service "apiserver" on-failure action is "restart", waiting ~30s before restart (backoff 412)
2023-10-27T14:38:14.281Z [pebble] Service "apiserver" starting: bash -c 'sleep 1.1 && /bin/apiserver --config=/config --sampleconfig=/config/sample_config.json -logtostderr=true '
2023-10-27T14:38:15.484Z [apiserver] I1027 14:38:15.483903 5144 client_manager.go:160] Initializing client manager
2023-10-27T14:38:15.484Z [apiserver] I1027 14:38:15.484023 5144 config.go:57] Config DBConfig.ExtraParams not specified, skipping
Thanks for the logs @natalytvinova, I will need also the status of the other kfp-* charms and minio, if you can provide them. Specially the state of the kfp-db
charm with juju status kfp-db
and kubectl logs
of the pod are also very useful.
I also see 2023-10-27T14:37:42.612Z [apiserver] F1027 14:37:42.612223 5132 error.go:337] dial tcp: lookup kfp-db-primary.kubeflow.svc.cluster.local: Temporary failure in name resolution
Could you please confirm that there is a charm called kfp-db-primary
and that the service exists (kubectl get svc kfp-db-primary -nkubeflow
)?
@DnPlas with the new 1.7 bundle and network issues for k8s fixed on our side, we no longer face this bug
The issue was the availability of a DNS server between the nodes. Thank you for help!
Closing the issue.
Bug Description
When deploying the bundle from "bundle-kubeflow" repository, Juju deploy fails on kfp-persistence trying to check the kfp-api heathcheck.
To Reproduce
Environment
Relevant Log Output
Additional Context
No response