canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
104 stars 50 forks source link

katib-db-manager fails update-status due to health check on workload container #654

Closed Sponge-Bas closed 11 months ago

Sponge-Bas commented 1 year ago

In test run https://solutions.qa.canonical.com/testruns/4c1c6e4a-c895-4c2b-88d9-16d2b109d511/, which is deploying ckf 1.7/stable on ck8s 1.24 focal on aws, the installation fails with the following juju status:

App                        Version                Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b  active       1  admission-webhook        1.7/stable      205  10.152.183.176  no       
argo-controller            res:oci-image@669ebd5  active       1  argo-controller          3.3/stable      236                  no       
argo-server                res:oci-image@576d038  active       1  argo-server              3.3/stable      185                  no       
dex-auth                                          active       1  dex-auth                 2.31/stable     224  10.152.183.56   no       
istio-ingressgateway                              active       1  istio-gateway            1.16/stable     551  10.152.183.221  no       
istio-pilot                                       active       1  istio-pilot              1.16/stable     551  10.152.183.141  no       
jupyter-controller         res:oci-image@1167186  active       1  jupyter-controller       1.7/stable      607                  no       
jupyter-ui                                        active       1  jupyter-ui               1.7/stable      534  10.152.183.4    no       
katib-controller           res:oci-image@111495a  active       1  katib-controller         0.15/stable     282  10.152.183.77   no       
katib-db                                          waiting      1  mysql-k8s                8.0/stable       75  10.152.183.54   no       installing agent
katib-db-manager                                  waiting      1  katib-db-manager         0.15/stable     253  10.152.183.219  no       installing agent
katib-ui                                          active       1  katib-ui                 0.15/stable     267  10.152.183.180  no       
kfp-api                                           waiting      1  kfp-api                  2.0/stable      540  10.152.183.49   no       installing agent
kfp-db                                            waiting      1  mysql-k8s                8.0/stable       75  10.152.183.155  no       installing agent
kfp-persistence                                   waiting      1  kfp-persistence          2.0/stable      500                  no       Waiting for kfp-api relation data
kfp-profile-controller     res:oci-image@b26a126  active       1  kfp-profile-controller   2.0/stable      478  10.152.183.233  no       
kfp-schedwf                res:oci-image@68cce0a  active       1  kfp-schedwf              2.0/stable      515                  no       
kfp-ui                                            waiting      1  kfp-ui                   2.0/stable      504                  no       Waiting for kfp-api relation data
kfp-viewer                 res:oci-image@c0f065d  active       1  kfp-viewer               2.0/stable      517                  no       
kfp-viz                    res:oci-image@3de6f3c  active       1  kfp-viz                  2.0/stable      476  10.152.183.243  no       
knative-eventing                                  active       1  knative-eventing         1.8/stable      224  10.152.183.150  no       
knative-operator                                  active       1  knative-operator         1.8/stable      199  10.152.183.157  no       
knative-serving                                   active       1  knative-serving          1.8/stable      224  10.152.183.175  no       
kserve-controller                                 active       1  kserve-controller        0.10/stable     267  10.152.183.90   no       
kubeflow-dashboard                                active       1  kubeflow-dashboard       1.7/stable      307  10.152.183.85   no       
kubeflow-profiles                                 active       1  kubeflow-profiles        1.7/stable      269  10.152.183.200  no       
kubeflow-roles                                    active       1  kubeflow-roles           1.7/stable      113  10.152.183.93   no       
kubeflow-volumes           res:oci-image@d261609  active       1  kubeflow-volumes         1.7/stable      178  10.152.183.165  no       
metacontroller-operator                           active       1  metacontroller-operator  2.0/stable      117  10.152.183.116  no       
minio                      res:oci-image@1755999  active       1  minio                    ckf-1.7/stable  186  10.152.183.241  no       
oidc-gatekeeper            res:oci-image@6b720b8  active       1  oidc-gatekeeper          ckf-1.7/stable  176  10.152.183.159  no       
seldon-controller-manager                         active       1  seldon-core              1.15/stable     457  10.152.183.227  no       
tensorboard-controller     res:oci-image@c52f7c2  active       1  tensorboard-controller   1.7/stable      156  10.152.183.178  no       
tensorboards-web-app       res:oci-image@929f55b  active       1  tensorboards-web-app     1.7/stable      158  10.152.183.101  no       
training-operator                                 active       1  training-operator        1.6/stable      215  10.152.183.181  no       

Unit                          Workload  Agent  Address          Ports              Message
admission-webhook/0*          active    idle   192.168.68.201   4443/TCP           
argo-controller/0*            active    idle   192.168.68.233                      
argo-server/0*                active    idle   192.168.68.204   2746/TCP           
dex-auth/0*                   active    idle   192.168.68.200                      
istio-ingressgateway/0*       active    idle   192.168.68.202                      
istio-pilot/0*                active    idle   192.168.68.203                      
jupyter-controller/0*         active    idle   192.168.184.70                      
jupyter-ui/0*                 active    idle   192.168.184.69                      
katib-controller/0*           active    idle   192.168.68.210   443/TCP,8080/TCP   
katib-db-manager/0*           error     idle   192.168.68.206                      hook failed: "update-status"
katib-db/0*                   blocked   idle   192.168.184.72                      Unable to configure instance
katib-ui/0*                   active    idle   192.168.229.203                     
kfp-api/0*                    waiting   idle   192.168.68.207                      Waiting for relational-db data
kfp-db/0*                     blocked   idle   192.168.229.204                     Unable to configure instance
kfp-persistence/0*            waiting   idle                                       Waiting for kfp-api relation data
kfp-profile-controller/0*     active    idle   192.168.229.216  80/TCP             
kfp-schedwf/0*                active    idle   192.168.229.213                     
kfp-ui/0*                     waiting   idle                                       Waiting for kfp-api relation data
kfp-viewer/0*                 active    idle   192.168.68.226                      
kfp-viz/0*                    active    idle   192.168.229.214  8888/TCP           
knative-eventing/0*           active    idle   192.168.68.208                      
knative-operator/0*           active    idle   192.168.68.214                      
knative-serving/0*            active    idle   192.168.68.209                      
kserve-controller/0*          active    idle   192.168.184.73                      
kubeflow-dashboard/0*         active    idle   192.168.68.213                      
kubeflow-profiles/0*          active    idle   192.168.68.216                      
kubeflow-roles/0*             active    idle   192.168.68.211                      
kubeflow-volumes/0*           active    idle   192.168.68.232   5000/TCP           
metacontroller-operator/0*    active    idle   192.168.68.212                      
minio/0*                      active    idle   192.168.184.83   9000/TCP,9001/TCP  
oidc-gatekeeper/0*            active    idle   192.168.68.234   8080/TCP           
seldon-controller-manager/0*  active    idle   192.168.184.74                      
tensorboard-controller/0*     active    idle   192.168.184.84   9443/TCP           
tensorboards-web-app/0*       active    idle   192.168.184.82   5000/TCP           
training-operator/0*          active    idle   192.168.229.205        

Looking at the pod logs (which can be downloaded here, it looks like a health check failed to run:

2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Error in sys.excepthook:
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Traceback (most recent call last):
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self.emit(record)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/log.py", line 41, in emit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self.model_backend.juju_log(record.levelname, self.format(record))
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     return fmt.format(record)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     record.exc_text = self.formatException(record.exc_info)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     traceback.print_exception(ei[0], ei[1], tb, None, sio)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     for line in TracebackException(
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/traceback.py", line 617, in format
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     yield from self.format_exception_only()
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     stype = smod + '.' + stype
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Original exception was:
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Traceback (most recent call last):
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 366, in _refresh_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     check = self._get_check_status()
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 360, in _get_check_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     return self.container.get_check("katib-db-manager-up").status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 1980, in get_check
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     raise ModelError(f'check {check_name!r} not found')
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status ops.model.ModelError: check 'katib-db-manager-up' not found
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status The above exception was the direct cause of the following exception:
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status Traceback (most recent call last):
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 430, in <module>
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     main(KatibDBManagerOperator)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/main.py", line 441, in main
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     _emit_charm_event(charm, dispatcher.event_name)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     event_to_emit.emit(*args, **kwargs)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/framework.py", line 354, in emit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     framework._emit(event)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/framework.py", line 830, in _emit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self._reemit(event_path)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/framework.py", line 919, in _reemit
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     custom_handler(event)
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 381, in _on_update_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     self._refresh_status()
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status   File "./src/charm.py", line 368, in _refresh_status
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status     raise GenericCharmRuntimeError(
2023-08-03T19:14:26.312Z [container-agent] 2023-08-03 19:14:26 WARNING update-status <unknown>GenericCharmRuntimeError: Failed to run health check on workload container

More logs and configs can be found here: https://oil-jenkins.canonical.com/artifacts/4c1c6e4a-c895-4c2b-88d9-16d2b109d511/index.html

orfeas-k commented 1 year ago

Looks like the same issue exactly with https://github.com/canonical/bundle-kubeflow/issues/631. Since I just responded there, I 'll just copy over my response here too

There is this known issue with mysql-k8s-operator which has been fixed but still not published to 8.0/stable version (you can view revisions published here). Could you please confirm that deploying 1.7/edge (which uses mysql-k8s edge channel) actually solves this issue for you?

orfeas-k commented 1 year ago

Note that we 're pushing for this to be released to 8.0/stable so using edge won't be needed.

NohaIhab commented 11 months ago

the fix in mysql-k8s-operator was released to 8.0/stable, this can be closed now.