canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
102 stars 49 forks source link

kubeflow services are not coming up using juju #447

Closed shrikantkeni closed 2 years ago

shrikantkeni commented 2 years ago

Multiple errors are showing in juju status wrt dex-auth, seldoncore.

even after istio patch roll istio services are showing in waiting state

ubuntu@ip-172-31-9-84:~$ juju status --color Model Controller Cloud/Region Version SLA Timestamp kubeflow my-controller myk8s/localhost 2.9.27 unsupported 13:57:34Z

App Version Status Scale Charm Channel Rev Address Exposed Message admission-webhook res:oci-image@fc124ea waiting 1 admission-webhook stable 12 no waiting for container argo-controller res:oci-image@0eec3c1 active 1 argo-controller stable 55 no dex-auth res:oci-image@a74f783 error 1 dex-auth 2.28/stable 78 no creating or updating custom resource definitions: ensuring custom resource definition "authcodes.dex.coreos.com" with version "v1beta1": cannot convert v1beta1 crd to v1: custom resource definition group "dex.coreos.com" not valid envoy res:oci-image@b4adee5 active 1 envoy stable 6 10.152.183.146 no istio-ingressgateway res:oci-image@aae58cf waiting 1 istio-gateway 1.5/stable 40 10.64.140.43 no istio-pilot res:oci-image@87fc646 waiting 1 istio-pilot 1.5/stable 61 10.152.183.193 no jupyter-controller res:oci-image@62a1ccf waiting 1 jupyter-controller stable 61 no waiting for container jupyter-ui res:oci-image@5536a2d active 1 jupyter-ui stable 21 10.152.183.95 no kfp-api res:oci-image@81e784a active 1 kfp-api stable 33 10.152.183.212 no kfp-db mariadb/server:10.3 active 1 charmed-osm-mariadb-k8s stable 35 10.152.183.249 no kfp-persistence res:oci-image@1012943 active 1 kfp-persistence stable 29 no kfp-profile-controller res:oci-image@14ec522 active 1 kfp-profile-controller stable 16 10.152.183.72 no kfp-schedwf res:oci-image@34e7e9e active 1 kfp-schedwf stable 32 no kfp-ui res:oci-image@b67a29c active 1 kfp-ui stable 32 10.152.183.170 no kfp-viewer res:oci-image@c208ebd active 1 kfp-viewer stable 31 no kfp-viz res:oci-image@13c46cf active 1 kfp-viz stable 28 10.152.183.63 no kubeflow-dashboard res:oci-image@858a90f waiting 1 kubeflow-dashboard stable 64 no waiting for container kubeflow-profiles res:profile-image@f4450cf error 1 kubeflow-profiles stable 57 no creating or updating custom resource definitions: ensuring custom resource definition "serviceroles.rbac.istio.io" with version "v1beta1": cannot convert v1beta1 crd to v1: custom resource definition group "rbac.istio.io" not valid kubeflow-roles active 1 kubeflow-roles stable 1 10.152.183.133 no kubeflow-volumes res:oci-image@fedee0e active 1 kubeflow-volumes stable 11 10.152.183.140 no metacontroller-operator active 1 metacontroller-operator stable 2 10.152.183.123 no minio res:oci-image@1755999 active 1 minio stable 57 10.152.183.186 no mlmd res:oci-image@e2cb9ce active 1 mlmd stable 10 10.152.183.171 no oidc-gatekeeper res:oci-image@4e7f8dd active 1 oidc-gatekeeper stable 57 10.152.183.148 no seldon-controller-manager res:oci-image@047f2fc waiting 1 seldon-core stable 52 10.152.183.230 no training-operator active 1 training-operator stable 6 10.152.183.124 no

Unit Workload Agent Address Ports Message admission-webhook/0 waiting idle waiting for container argo-controller/0 active idle 10.1.40.44
dex-auth/0 waiting idle waiting for container envoy/0 active idle 10.1.40.45 9901/TCP,9090/TCP
istio-ingressgateway/0 waiting idle 10.1.40.51 15020/TCP,80/TCP,443/TCP,15029/TCP,15030/TCP,15031/TCP,15032/TCP,15443/TCP,15011/TCP,8060/TCP,853/TCP waiting for container istio-pilot/0 waiting idle 10.1.40.17 8080/TCP,15010/TCP,15012/TCP,15017/TCP waiting for container jupyter-controller/0 waiting idle waiting for container jupyter-ui/0 active idle 10.1.40.19 5000/TCP
kfp-api/0 active idle 10.1.40.48 8888/TCP,8887/TCP
kfp-db/0
active idle 10.1.40.22 3306/TCP ready kfp-persistence/0 active idle 10.1.40.42
kfp-profile-controller/0
active idle 10.1.40.43 80/TCP
kfp-schedwf/0 active idle 10.1.40.30
kfp-ui/0
active idle 10.1.40.49 3000/TCP
kfp-viewer/0 active idle 10.1.40.29
kfp-viz/0
active idle 10.1.40.38 8888/TCP
kubeflow-dashboard/0 waiting idle waiting for container kubeflow-profiles/0 waiting idle waiting for container kubeflow-roles/0 active idle 10.1.40.20
kubeflow-volumes/0
active idle 10.1.40.35 5000/TCP
metacontroller-operator/0 active idle 10.1.40.23
minio/0
active idle 10.1.40.39 9000/TCP
mlmd/0 active idle 10.1.40.50 8080/TCP
oidc-gatekeeper/2
active idle 10.1.40.56 8080/TCP
seldon-controller-manager/0 error idle 10.1.40.46 8080/TCP,4443/TCP crash loop backoff: back-off 5m0s restarting failed container=seldon-core pod=seldon-controller-manager-5c447f5585-kvdrk_kubeflow(f874496a-0f46-4a5a-90c5-14348144715d) training-operator/0 active idle 10.1.40.25

ca-scribner commented 2 years ago

Did this eventually resolve itself? I see a lot of waiting for container messages in the status - that sounds like it is just taking time to deploy everything (probably taking time to download the image, or might be CPU bottlenecked while it actually deploys everything). Depending on the machine it'll take at least a few minutes. You can look at the workloads (kubectl get pods, kubectl describe pod ___) and see what they're waiting on. If it is something like pulling the image, then I think you're just stuck waiting on the internet.

DomFleischmann commented 2 years ago

Closing this issue because of no response, feel free to reopen if more information is needed.