canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

Argo-controller in error state without pod-log #490

Closed Sponge-Bas closed 11 months ago

Sponge-Bas commented 2 years ago

In testrun https://solutions.qa.canonical.com/testruns/testRun/4766dc32-4035-4840-8d5d-1bdc13853881, the argo-controller unit is in error state:

App                        Version                    Status   Scale  Charm                    Channel       Rev  Address         Exposed  Message
admission-webhook          res:oci-image@ce13018      active       1  admission-webhook        1.6/edge       44  10.152.183.172  no       
argo-controller            res:oci-image@669ebd5      waiting    2/1  argo-controller          3.3/edge       99                  no       
dex-auth                                              active       1  dex-auth                 2.31/edge     129  10.152.183.134  no       
istio-ingressgateway                                  active       1  istio-gateway            1.11/edge     107  10.152.183.227  no       
istio-pilot                                           waiting      1  istio-pilot              1.11/edge     127  10.152.183.146  no       installing agent
jupyter-controller         res:oci-image@c6baf31      active       1  jupyter-controller       1.6/edge      125                  no       
jupyter-ui                 res:oci-image@880266c      active       1  jupyter-ui               1.6/edge       86  10.152.183.168  no       
kfp-api                    res:oci-image@1b44753      active       1  kfp-api                  2.0/edge       81  10.152.183.142  no       
kfp-db                     mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  stable         35  10.152.183.37   no       ready
kfp-persistence            res:oci-image@31f08ad      active       1  kfp-persistence          2.0/edge       76                  no       
kfp-profile-controller     res:oci-image@d86ecff      active       1  kfp-profile-controller   2.0/edge       61  10.152.183.177  no       
kfp-schedwf                res:oci-image@51ffc60      active       1  kfp-schedwf              2.0/edge       80                  no       
kfp-ui                     res:oci-image@55148fd      active       1  kfp-ui                   2.0/edge       80  10.152.183.124  no       
kfp-viewer                 res:oci-image@7190aa3      active       1  kfp-viewer               2.0/edge       79                  no       
kfp-viz                    res:oci-image@67e8b09      active       1  kfp-viz                  2.0/edge       74  10.152.183.76   no       
kubeflow-dashboard         res:oci-image@1b9efb1      active       1  kubeflow-dashboard       1.6/edge      124  10.152.183.38   no       
kubeflow-profiles          res:profile-image@b1d7af7  active       1  kubeflow-profiles        1.6/edge       80  10.152.183.243  no       
kubeflow-roles                                        active       1  kubeflow-roles           1.6/edge       31  10.152.183.88   no       
kubeflow-volumes           res:oci-image@889a67c      active       1  kubeflow-volumes         1.6/edge       55  10.152.183.112  no       
metacontroller-operator                               active       1  metacontroller-operator  2.0/edge       48  10.152.183.15   no       
minio                      res:oci-image@1755999      active       1  minio                    ckf-1.6/edge   95  10.152.183.196  no       
oidc-gatekeeper            res:oci-image@32de216      active       1  oidc-gatekeeper          ckf-1.6/edge   76  10.152.183.118  no       
seldon-controller-manager  res:oci-image@eb811b6      active       1  seldon-core              1.14/edge      87  10.152.183.89   no       
training-operator                                     active       1  training-operator        1.5/edge       65  10.152.183.193  no       

Unit                          Workload     Agent      Address          Ports              Message
admission-webhook/0*          active       idle       192.168.147.202  4443/TCP           
argo-controller/0*            error        idle       192.168.135.83                      container error: 
argo-controller/1             waiting      executing  192.168.158.94                      (config-changed) Waiting for leadership
dex-auth/0*                   active       idle       192.168.158.74                      
istio-ingressgateway/0*       active       idle       192.168.147.201                     
istio-pilot/0*                active       executing  192.168.158.96                      (upgrade-charm) 
jupyter-controller/0*         active       idle       192.168.158.76                      
jupyter-ui/0*                 active       idle       192.168.135.72   5000/TCP           
kfp-api/0*                    active       idle       192.168.158.90   8888/TCP,8887/TCP  
kfp-db/0*                     active       idle       192.168.147.208  3306/TCP           ready
kfp-persistence/0*            active       idle       192.168.147.213                     
kfp-profile-controller/0*     active       idle       192.168.158.89   80/TCP             
kfp-schedwf/0*                active       idle       192.168.158.84                      
kfp-ui/0*                     active       idle       192.168.158.91   3000/TCP           
kfp-viewer/0*                 active       idle       192.168.158.83                      
kfp-viz/0*                    active       idle       192.168.135.77   8888/TCP           
kubeflow-dashboard/0*         active       idle       192.168.158.88   8082/TCP           
kubeflow-profiles/0*          active       idle       192.168.147.212  8080/TCP,8081/TCP  
kubeflow-roles/0*             active       idle       192.168.158.78                      
kubeflow-volumes/0*           maintenance  executing  192.168.135.80   5000/TCP           (upgrade-charm) Setting pod spec
metacontroller-operator/0*    active       idle       192.168.158.79                      
minio/0*                      active       executing  192.168.158.87   9000/TCP,9001/TCP  
oidc-gatekeeper/0*            active       idle       192.168.135.82   8080/TCP           
seldon-controller-manager/0*  active       idle       192.168.147.211  8080/TCP,4443/TCP  
training-operator/0*          active       idle       192.168.158.81                      

The kubeflow crashdump shows that the pod is indeed in error state, but the kubernetes crashdump shows no logs for this container under 9/baremetal/pod-logs/.

I'm not too sure what happened here, but maybe one of you can make it out from the available logs.

ca-scribner commented 2 years ago

I am also confused. Why argo-controller/0 went into error is one question (don't see anything in the debug-logs that tell me), but that argo-controller/1 was prevented from taking over sounds even more problematic. I wonder if that's because argo-controller/0 was stuck in error state?

We'll need to dig in further

moisesbenzan commented 2 years ago

Also seen on this run: https://solutions.qa.canonical.com/v2/testruns/b53a3ef6-fe06-434b-8a50-d2acab67eb48/

However, this time with the kfp-viz units. Here is the Kubernetes crashdump and here is the Kubeflow crashdump

moisesbenzan commented 2 years ago

Seen again on this run: https://solutions.qa.canonical.com/v2/testruns/dfd21ddf-545e-4427-9e44-0a224a91a36a/ This time around with kfp-profile-controller units. Here's the Kubernetes crashdump and here is the Kubeflow crashdump.

ca-scribner commented 2 years ago

ty for these logs! We will check these out

beliaev-maksim commented 1 year ago

this should be resolved by sidecar rewrite. Not taking action until rewrite

ca-scribner commented 11 months ago

We think these should be resolved in the most recent charms. Please reopen or post again if this persists.