canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
97 stars 47 forks source link

katib-db-manager: hook failed: "update-status" #963

Open ACodingfreak opened 2 days ago

ACodingfreak commented 2 days ago

Bug Description

As shown in below juju status, katib-db-manager unit is stuck with " hook failed: "update-status""

mm323:~$ juju status
Model     Controller  Cloud/Region      Version  SLA          Timestamp
kubeflow  uk8sx       my-k8s/localhost  2.9.49   unsupported  16:36:42-07:00

App                        Version                  Status   Scale  Charm                    Channel              Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b    active       1  admission-webhook        1.7/stable           224  10.152.183.247  no
argo-controller            res:oci-image@3902c16    active       1  argo-controller          3.3/stable           376                  no
argo-server                res:oci-image@e2292c9    active       1  argo-server              3.3/stable           309                  no
dex-auth                                            active       1  dex-auth                 2.31/stable          389  10.152.183.43   no
istio-ingressgateway                                active       1  istio-gateway            1.16/stable         1005  10.152.183.29   no
istio-pilot                                         active       1  istio-pilot              1.16/stable          662  10.152.183.128  no
jupyter-controller         res:oci-image@1167186    active       1  jupyter-controller       1.7/stable           805                  no
jupyter-ui                                          active       1  jupyter-ui               1.7/stable           781  10.152.183.198  no
katib-controller           res:oci-image@111495a    active       1  katib-controller         0.15/stable          282  10.152.183.65   no
katib-db                   8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable           153  10.152.183.12   no
katib-db-manager                                    waiting      1  katib-db-manager         0.15/stable          253  10.152.183.151  no       installing agent
katib-ui                                            active       1  katib-ui                 0.15/stable          267  10.152.183.13   no
kfp-api                                             active       1  kfp-api                  2.0-alpha.7/stable   935  10.152.183.50   no
kfp-db                     8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable           153  10.152.183.84   no
kfp-persistence            res:oci-image@ebed770    active       1  kfp-persistence          2.0-alpha.7/stable   939                  no
kfp-profile-controller     res:oci-image@aa75b0c    active       1  kfp-profile-controller   2.0-alpha.7/stable   899  10.152.183.56   no
kfp-schedwf                res:oci-image@2cb9087    active       1  kfp-schedwf              2.0-alpha.7/stable   952                  no
kfp-ui                     res:oci-image@ae72602    active       1  kfp-ui                   2.0-alpha.7/stable   934  10.152.183.217  no
kfp-viewer                 res:oci-image@899e25f    active       1  kfp-viewer               2.0-alpha.7/stable   964                  no
kfp-viz                    res:oci-image@ffaf37e    active       1  kfp-viz                  2.0-alpha.7/stable   889  10.152.183.70   no
knative-eventing                                    active       1  knative-eventing         1.8/stable           345  10.152.183.174  no
knative-operator                                    active       1  knative-operator         1.8/stable           320  10.152.183.208  no
knative-serving                                     active       1  knative-serving          1.8/stable           346  10.152.183.73   no
kserve-controller                                   active       1  kserve-controller        0.10/stable          458  10.152.183.177  no
kubeflow-dashboard                                  active       1  kubeflow-dashboard       1.7/stable           439  10.152.183.206  no
kubeflow-profiles                                   active       1  kubeflow-profiles        1.7/stable           336  10.152.183.112  no
kubeflow-roles                                      active       1  kubeflow-roles           1.7/stable           148  10.152.183.3    no
kubeflow-volumes           res:oci-image@d261609    active       1  kubeflow-volumes         1.7/stable           204  10.152.183.41   no
metacontroller-operator                             active       1  metacontroller-operator  2.0/stable           204  10.152.183.120  no
minio                      res:oci-image@1755999    active       1  minio                    ckf-1.7/stable       214  10.152.183.121  no
oidc-gatekeeper            res:oci-image@7aae6d7    active       1  oidc-gatekeeper          ckf-1.7/stable       320  10.152.183.75   no
seldon-controller-manager                           active       1  seldon-core              1.15/stable          548  10.152.183.59   no
tensorboard-controller     res:oci-image@c52f7c2    active       1  tensorboard-controller   1.7/stable           156  10.152.183.71   no
tensorboards-web-app       res:oci-image@929f55b    active       1  tensorboards-web-app     1.7/stable           158  10.152.183.115  no
training-operator                                   active       1  training-operator        1.6/stable           305  10.152.183.76   no

Unit                          Workload  Agent  Address       Ports              Message
admission-webhook/0*          active    idle   10.1.121.226  4443/TCP
argo-controller/0*            active    idle   10.1.69.129
argo-server/0*                active    idle   10.1.121.229  2746/TCP
dex-auth/0*                   active    idle   10.1.121.204
istio-ingressgateway/0*       active    idle   10.1.121.205
istio-pilot/0*                active    idle   10.1.69.161
jupyter-controller/0*         active    idle   10.1.121.231
jupyter-ui/0*                 active    idle   10.1.69.164
katib-controller/0*           active    idle   10.1.121.230  443/TCP,8080/TCP
katib-db-manager/0*           error     idle   10.1.121.206                     hook failed: "update-status"
katib-db/0*                   active    idle   10.1.69.167                      Primary
katib-ui/0*                   active    idle   10.1.121.208
kfp-api/0*                    active    idle   10.1.121.209
kfp-db/0*                     active    idle   10.1.121.211                     Primary
kfp-persistence/0*            active    idle   10.1.69.133
kfp-profile-controller/0*     active    idle   10.1.69.130   80/TCP
kfp-schedwf/0*                active    idle   10.1.121.232
kfp-ui/0*                     active    idle   10.1.69.136   3000/TCP
kfp-viewer/0*                 active    idle   10.1.69.179
kfp-viz/0*                    active    idle   10.1.69.131   8888/TCP
knative-eventing/0*           active    idle   10.1.69.168
knative-operator/0*           active    idle   10.1.69.171
knative-serving/0*            active    idle   10.1.69.170
kserve-controller/0*          active    idle   10.1.69.173
kubeflow-dashboard/0*         active    idle   10.1.69.172
kubeflow-profiles/0*          active    idle   10.1.69.175
kubeflow-roles/0*             active    idle   10.1.69.174
kubeflow-volumes/0*           active    idle   10.1.121.217  5000/TCP
metacontroller-operator/0*    active    idle   10.1.121.212
minio/0*                      active    idle   10.1.121.221  9000/TCP,9001/TCP
oidc-gatekeeper/0*            active    idle   10.1.69.141   8080/TCP
seldon-controller-manager/0*  active    idle   10.1.69.177
tensorboard-controller/0*     active    idle   10.1.69.135   9443/TCP
tensorboards-web-app/0*       active    idle   10.1.69.182   5000/TCP
training-operator/0*          active    idle   10.1.121.215

To Reproduce

sudo snap install microk8s --channel=1.24/stable --classic
sudo snap install juju --classic --channel=2.9/stable
microk8s config | juju add-k8s my-k8s --client
juju bootstrap my-k8s uk8sx
juju add-model kubeflow
juju deploy kubeflow --trust  --channel=1.7/stable

Environment

Ubuntu:22.04
microk8s:1.24
juju:2.9
kubeflow:1.7

Relevant Log Output

Attaching below logs 

$ microk8s.kubectl logs -n kubeflow katib-db-manager-0 > katib-db-manager-0
Defaulted container "charm" out of: charm, katib-db-manager, charm-init (init)

$ microk8s.kubectl logs -n kubeflow katib-db-0 > katib-db-0
Defaulted container "charm" out of: charm, mysql, charm-init (init)

logs_2.zip

Additional Context

No response

syncronize-issues-to-jira[bot] commented 2 days ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5953.

This message was autogenerated

ACodingfreak commented 2 days ago

When I tried deleting respective katib-manager pod, microk8s automatically started the respective pod and then it came up properly

kubectl delete pod katib-db-manager-0 -n kubeflow

ACodingfreak commented 2 days ago

But now I am having issue with the kubeflow UI

When I login with the weblink http://10.10.26.236:31456/ I land up in below page

image

Clicked "start setup"

image

Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.

$ kubectl get profiles
NAME    AGE
admin   20m
ACodingfreak commented 2 days ago

In between katib-db went down

unit-katib-ui-0: 21:38:51 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-katib-db-0: 21:39:08 ERROR unit.katib-db/0.juju-log Failed to flush [<MySQLTextLogs.ERROR: 'ERROR LOGS'>, <MySQLTextLogs.GENERAL: 'GENERAL LOGS'>, <MySQLTextLogs.SLOW: 'SLOW LOGS'>] logs.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 602, in _run_mysqlsh_script
    stdout, _ = process.wait_output()
  File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1635, in wait_output
    raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2024-07-02T04:39:06Z: Loading startup files...\nverbose: 2024-07-02T04:39:06Z: Loading plugins...\nverbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local\nTraceback (most recent call last):\n  File "<string>", line 1, in <module>\nmysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at \'reading initial communication packet\', system error: 104\n'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/mysql/v0/mysql.py", line 3139, in flush_mysql_logs
    self._run_mysqlsh_script("\n".join(flush_logs_commands), timeout=50)
  File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 605, in _run_mysqlsh_script
    raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2024-07-02T04:39:06Z: Loading startup files...
verbose: 2024-07-02T04:39:06Z: Loading plugins...
verbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'reading initial communication packet', system error: 104

unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Setting up the logrotate configurations
unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Adding pebble layer
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Unit workload member-state is offline with member-role unknown
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Attempting reboot from complete outage.

$ juju status
Model     Controller  Cloud/Region      Version  SLA          Timestamp
kubeflow  uk8sx       my-k8s/localhost  2.9.49   unsupported  21:42:42-07:00

App                        Version                  Status   Scale  Charm                    Channel              Rev  Address         Exposed  Message
admission-webhook          res:oci-image@2d74d1b    active       1  admission-webhook        1.7/stable           224  10.152.183.247  no
argo-controller            res:oci-image@3902c16    active       1  argo-controller          3.3/stable           376                  no
argo-server                res:oci-image@e2292c9    active       1  argo-server              3.3/stable           309                  no
dex-auth                                            active       1  dex-auth                 2.31/stable          389  10.152.183.43   no
istio-ingressgateway                                active       1  istio-gateway            1.16/stable         1005  10.152.183.29   no
istio-pilot                                         active       1  istio-pilot              1.16/stable          662  10.152.183.128  no
jupyter-controller         res:oci-image@1167186    active       1  jupyter-controller       1.7/stable           805                  no
jupyter-ui                                          active       1  jupyter-ui               1.7/stable           781  10.152.183.198  no
katib-controller           res:oci-image@111495a    active       1  katib-controller         0.15/stable          282  10.152.183.65   no
katib-db                   8.0.36-0ubuntu0.22.04.1  waiting      1  mysql-k8s                8.0/stable           153  10.152.183.12   no       installing agent
katib-db-manager                                    active       1  katib-db-manager         0.15/stable          253  10.152.183.151  no
katib-ui                                            active       1  katib-ui                 0.15/stable          267  10.152.183.13   no
kfp-api                                             active       1  kfp-api                  2.0-alpha.7/stable   935  10.152.183.50   no
kfp-db                     8.0.36-0ubuntu0.22.04.1  active       1  mysql-k8s                8.0/stable           153  10.152.183.84   no
kfp-persistence            res:oci-image@ebed770    active       1  kfp-persistence          2.0-alpha.7/stable   939                  no
kfp-profile-controller     res:oci-image@aa75b0c    active       1  kfp-profile-controller   2.0-alpha.7/stable   899  10.152.183.56   no
kfp-schedwf                res:oci-image@2cb9087    active       1  kfp-schedwf              2.0-alpha.7/stable   952                  no
kfp-ui                     res:oci-image@ae72602    active       1  kfp-ui                   2.0-alpha.7/stable   934  10.152.183.217  no
kfp-viewer                 res:oci-image@899e25f    active       1  kfp-viewer               2.0-alpha.7/stable   964                  no
kfp-viz                    res:oci-image@ffaf37e    active       1  kfp-viz                  2.0-alpha.7/stable   889  10.152.183.70   no
knative-eventing                                    active       1  knative-eventing         1.8/stable           345  10.152.183.174  no
knative-operator                                    active       1  knative-operator         1.8/stable           320  10.152.183.208  no
knative-serving                                     active       1  knative-serving          1.8/stable           346  10.152.183.73   no
kserve-controller                                   active       1  kserve-controller        0.10/stable          458  10.152.183.177  no
kubeflow-dashboard                                  active       1  kubeflow-dashboard       1.7/stable           439  10.152.183.206  no
kubeflow-profiles                                   active       1  kubeflow-profiles        1.7/stable           336  10.152.183.112  no
kubeflow-roles                                      active       1  kubeflow-roles           1.7/stable           148  10.152.183.3    no
kubeflow-volumes           res:oci-image@d261609    active       1  kubeflow-volumes         1.7/stable           204  10.152.183.41   no
metacontroller-operator                             active       1  metacontroller-operator  2.0/stable           204  10.152.183.120  no
minio                      res:oci-image@1755999    active       1  minio                    ckf-1.7/stable       214  10.152.183.121  no
oidc-gatekeeper            res:oci-image@7aae6d7    active       1  oidc-gatekeeper          ckf-1.7/stable       320  10.152.183.75   no
seldon-controller-manager                           active       1  seldon-core              1.15/stable          548  10.152.183.59   no
tensorboard-controller     res:oci-image@c52f7c2    active       1  tensorboard-controller   1.7/stable           156  10.152.183.71   no
tensorboards-web-app       res:oci-image@929f55b    active       1  tensorboards-web-app     1.7/stable           158  10.152.183.115  no
training-operator                                   active       1  training-operator        1.6/stable           305  10.152.183.76   no

Unit                          Workload     Agent  Address       Ports              Message
admission-webhook/0*          active       idle   10.1.121.226  4443/TCP
argo-controller/0*            active       idle   10.1.69.129
argo-server/0*                active       idle   10.1.121.229  2746/TCP
dex-auth/0*                   active       idle   10.1.121.204
istio-ingressgateway/0*       active       idle   10.1.121.205
istio-pilot/0*                active       idle   10.1.69.161
jupyter-controller/0*         active       idle   10.1.121.231
jupyter-ui/0*                 active       idle   10.1.69.164
katib-controller/0*           active       idle   10.1.121.230  443/TCP,8080/TCP
katib-db-manager/0*           active       idle   10.1.69.142
katib-db/0*                   maintenance  idle   10.1.69.167                      offline
katib-ui/0*                   active       idle   10.1.121.208
kfp-api/0*                    active       idle   10.1.121.209
kfp-db/0*                     active       idle   10.1.121.211                     Primary

It got restarted automatically and came up properly

DnPlas commented 1 day ago

@shayancanonical maybe you should also take a look at this one

ACodingfreak commented 1 day ago

But now I am having issue with the kubeflow UI

When I login with the weblink http://10.10.26.236:31456/ I land up in below page

image

Clicked "start setup"

image

Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.

$ kubectl get profiles
NAME    AGE
admin   20m

On server reboot, juju status was showing multiple units in "agent lost, see 'juju show-status-log" state. Restarted respective pods which are in agent lost state that includes dex. I again tried logging into kubeflow ui and this time I have passed beyond initial startup windows.