canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

Katib-ui stuck executing: (leader-elected) #482

Closed Sponge-Bas closed 2 years ago

Sponge-Bas commented 2 years ago

In this testrun: https://solutions.qa.canonical.com/testruns/testRun/091e5168-80ac-4ed5-886c-476f47ce8b84, which is kubflow 1.6/beta on baremetal charmed k8s 1.22, the deployment dies with the following status:

App                        Version                    Status   Scale  Charm                    Channel       Rev  Address         Exposed  Message
admission-webhook          res:oci-image@27b5dd6      active       1  admission-webhook        1.6/beta       41  10.152.183.48   no       
argo-controller            res:oci-image@669ebd5      active       1  argo-controller          3.3/beta       99                  no       
argo-server                res:oci-image@576d038      active       1  argo-server              3.3/beta       45                  no       
dex-auth                                              active       1  dex-auth                 2.31/beta     129  10.152.183.103  no       
istio-ingressgateway                                  waiting      1  istio-gateway            1.11/beta      97  10.152.183.214  no       installing agent
istio-pilot                                           waiting      1  istio-pilot              1.11/beta     118  10.152.183.65   no       installing agent
jupyter-controller         res:oci-image@c6baf31      active       1  jupyter-controller       1.6/beta      125                  no       
jupyter-ui                 res:oci-image@880266c      active       1  jupyter-ui               1.6/beta       86  10.152.183.146  no       
katib-controller           res:oci-image@7573c56      active       1  katib-controller         0.14/beta      71  10.152.183.176  no       
katib-db                   mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  stable         35  10.152.183.178  no       ready
katib-db-manager           res:oci-image@9ccf2e1      active       1  katib-db-manager         0.14/beta      45  10.152.183.142  no       
katib-ui                   res:oci-image@47108eb      waiting      1  katib-ui                 0.14/beta      69  10.152.183.113  no       
kfp-api                    res:oci-image@1b44753      waiting      1  kfp-api                  2.0/beta       81  10.152.183.253  no       
kfp-db                     mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  stable         35  10.152.183.23   no       ready
kfp-persistence            res:oci-image@31f08ad      active       1  kfp-persistence          2.0/beta       76                  no       
kfp-profile-controller     res:oci-image@d86ecff      waiting      1  kfp-profile-controller   2.0/beta       61  10.152.183.201  no       
kfp-schedwf                res:oci-image@51ffc60      active       1  kfp-schedwf              2.0/beta       80                  no       
kfp-ui                     res:oci-image@55148fd      waiting      1  kfp-ui                   2.0/beta       80  10.152.183.88   no       
kfp-viewer                 res:oci-image@7190aa3      active       1  kfp-viewer               2.0/beta       79                  no       
kfp-viz                    res:oci-image@67e8b09      waiting      1  kfp-viz                  2.0/beta       74  10.152.183.42   no       
kubeflow-dashboard         res:oci-image@1b9efb1      active       1  kubeflow-dashboard       1.6/beta      124  10.152.183.145  no       
kubeflow-profiles          res:profile-image@b329ecc  active       1  kubeflow-profiles        1.6/beta       78  10.152.183.189  no       
kubeflow-roles                                        active       1  kubeflow-roles           1.6/beta       31  10.152.183.186  no       
kubeflow-volumes           res:oci-image@889a67c      active       1  kubeflow-volumes         1.6/beta       55  10.152.183.149  no       
metacontroller-operator                               active       1  metacontroller-operator  2.0/beta       48  10.152.183.84   no       
minio                      res:oci-image@1755999      waiting      1  minio                    ckf-1.6/beta   95  10.152.183.3    no       
oidc-gatekeeper                                       waiting      1  oidc-gatekeeper          ckf-1.6/beta   76                  no       List of ingress-auth versions not found for apps: istio-pilot
seldon-controller-manager  res:oci-image@eb811b6      active       1  seldon-core              1.14/beta      87  10.152.183.114  no       
tensorboard-controller                                waiting      1  tensorboard-controller   1.6/beta       33                  no       Waiting for gateway relation
tensorboards-web-app       res:oci-image@57cbde3      active       1  tensorboards-web-app     1.6/beta       33  10.152.183.92   no       
training-operator                                     active       1  training-operator        1.5/beta       65  10.152.183.157  no       

Unit                          Workload  Agent      Address          Ports              Message
admission-webhook/0*          active    idle       192.168.253.133  4443/TCP           
argo-controller/0*            active    idle       192.168.254.204                     
argo-server/0*                active    idle       192.168.254.69   2746/TCP           
dex-auth/0*                   active    idle       192.168.253.134                     
istio-ingressgateway/0*       waiting   idle       192.168.254.70                      Waiting for istio-pilot relation data, deferring event
istio-pilot/0*                waiting   idle       192.168.254.199                     List of ingress versions not found for apps: katib-ui
jupyter-controller/0*         active    idle       192.168.254.200                     
jupyter-ui/0*                 active    idle       192.168.253.135  5000/TCP           
katib-controller/0*           active    idle       192.168.255.2    443/TCP,8080/TCP   
katib-db-manager/0*           active    idle       192.168.253.139  6789/TCP           
katib-db/0*                   active    idle       192.168.254.73   3306/TCP           ready
katib-ui/0*                   waiting   executing  192.168.254.75   8080/TCP           (leader-elected) waiting for container
kfp-api/0*                    error     idle       192.168.255.10   8888/TCP,8887/TCP  crash loop backoff: back-off 20s restarting failed container=ml-pipeline-api-server pod=kfp-api-df6ddcd9b-jg7sl_kubeflow(467cab7c-c5eb-4db5-afb3-2fe5e31726f3)
kfp-db/0*                     active    idle       192.168.253.197  3306/TCP           ready
kfp-persistence/0*            active    idle       192.168.254.79                      
kfp-profile-controller/0*     waiting   idle       192.168.254.80   80/TCP             waiting for container
kfp-schedwf/0*                active    idle       192.168.254.76                      
kfp-ui/0*                     waiting   idle       192.168.253.203  3000/TCP           waiting for container
kfp-viewer/0*                 active    idle       192.168.252.136                     
kfp-viz/0*                    waiting   idle       192.168.253.200  8888/TCP           waiting for container
kubeflow-dashboard/0*         active    idle       192.168.254.202  8082/TCP           
kubeflow-profiles/0*          active    idle       192.168.254.78   8080/TCP,8081/TCP  
kubeflow-roles/0*             active    idle       192.168.255.3                       
kubeflow-volumes/0*           active    idle       192.168.252.139  5000/TCP           
metacontroller-operator/0*    active    idle       192.168.255.4                       
minio/0*                      waiting   idle       192.168.253.202  9000/TCP,9001/TCP  waiting for container
oidc-gatekeeper/0*            waiting   idle                                           List of ingress-auth versions not found for apps: istio-pilot
seldon-controller-manager/0*  active    idle       192.168.253.144  8080/TCP,4443/TCP  
tensorboard-controller/0*     waiting   idle                                           Waiting for gateway relation
tensorboards-web-app/0*       active    idle       192.168.254.203  5000/TCP           
training-operator/0*          active    idle       192.168.255.6   

There are several problems but I want to focus on katib-ui because the problems with this charm are consistent across re-deployments. Firstly, katib-ui takes very long to get a pod. This test run stopped after 5 minutes due to the kfp-api error, but the previous deployment had the katib-ui pod stuck on Init:0/1 for about 50 min before it came up. I'm not sure what the root cause of this is.

Secondly, when katib-ui comes up it gets stuck on status 'executing' with message '(leader-elected)':

istio-ingressgateway/0*       waiting   idle       192.168.254.70                      Waiting for istio-pilot relation data, deferring event
istio-pilot/0*                waiting   idle       192.168.254.199                     List of ingress versions not found for apps: katib-ui
jupyter-controller/0*         active    idle       192.168.254.200                     
jupyter-ui/0*                 active    idle       192.168.253.135  5000/TCP           
katib-controller/0*           active    idle       192.168.255.2    443/TCP,8080/TCP   
katib-db-manager/0*           active    idle       192.168.253.139  6789/TCP           
katib-db/0*                   active    idle       192.168.254.73   3306/TCP           ready
katib-ui/0*                   active    executing  192.168.254.75   8080/TCP           (leader-elected)

This holds up the istio charms which then hold up the oidc-gatekeeper charm. I think the problem here is that the charm is missing a refresh action for this specific message.

The logs for this testrun can be found here: https://oil-jenkins.canonical.com/artifacts/091e5168-80ac-4ed5-886c-476f47ce8b84/index.html

I think the files of interest are:

This is an automated test run and the environment was torn down when the error was encountered. If more information is needed, please let me know what exactly we need and I can collect this information in the next test run.

ca-scribner commented 2 years ago

Sorry @Basdbruijne it a while to get back to you on this. Is this still happening now that we've released 1.6/stable?

If yes, I'm curious about kubectl describe pod katib-ui-bf5875974-5pmv7 (or whatever the katib-ui's workload is named in the new env). Specifically, I'm wondering if it is stuck pulling the image or if something else happened. That has gotten me in the past, though never with katib-ui (I don't think the image is very large) and usually not repeatedly between deployments.

Either way, am I right that you're saying because katib-ui is stuck it has prevented istio-pilot from supporting the rest of your deployment (eg: there's some other page not accessible, etc)? I think that's what I see here, specifically because istio-gateway looks non-functional, but trying to make sure

Sponge-Bas commented 2 years ago

Either way, am I right that you're saying because katib-ui is stuck it has prevented istio-pilot from supporting the rest of your deployment (eg: there's some other page not accessible, etc)? I think that's what I see here, specifically because istio-gateway looks non-functional, but trying to make sure

Yes that sounds right to me. I will schedule some deployments with 1.6/stable to see if the problem is fixes.

DomFleischmann commented 2 years ago

Hello @Basdbruijne any updates on this?

Sponge-Bas commented 2 years ago

Hi @DomFleischmann, we did not see this problem since switching to 1.6/stable so I think we can close this bug.