Open ACodingfreak opened 2 days ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5953.
This message was autogenerated
When I tried deleting respective katib-manager pod, microk8s automatically started the respective pod and then it came up properly
kubectl delete pod katib-db-manager-0 -n kubeflow
But now I am having issue with the kubeflow UI
When I login with the weblink http://10.10.26.236:31456/ I land up in below page
Clicked "start setup"
Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.
$ kubectl get profiles
NAME AGE
admin 20m
In between katib-db went down
unit-katib-ui-0: 21:38:51 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-katib-db-0: 21:39:08 ERROR unit.katib-db/0.juju-log Failed to flush [<MySQLTextLogs.ERROR: 'ERROR LOGS'>, <MySQLTextLogs.GENERAL: 'GENERAL LOGS'>, <MySQLTextLogs.SLOW: 'SLOW LOGS'>] logs.
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 602, in _run_mysqlsh_script
stdout, _ = process.wait_output()
File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1635, in wait_output
raise ExecError[AnyStr](self._command, exit_code, out_value, err_value)
ops.pebble.ExecError: non-zero exit code 1 executing ['/usr/bin/mysqlsh', '--no-wizard', '--python', '--verbose=1', '-f', '/tmp/script.py', ';', 'rm', '/tmp/script.py'], stdout='', stderr='Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory\nverbose: 2024-07-02T04:39:06Z: Loading startup files...\nverbose: 2024-07-02T04:39:06Z: Loading plugins...\nverbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local\nTraceback (most recent call last):\n File "<string>", line 1, in <module>\nmysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at \'reading initial communication packet\', system error: 104\n'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/mysql/v0/mysql.py", line 3139, in flush_mysql_logs
self._run_mysqlsh_script("\n".join(flush_logs_commands), timeout=50)
File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 544, in wrapped_function
return callable(*args, **kwargs) # type: ignore
File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 605, in _run_mysqlsh_script
raise MySQLClientError(e.stderr)
charms.mysql.v0.mysql.MySQLClientError: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
verbose: 2024-07-02T04:39:06Z: Loading startup files...
verbose: 2024-07-02T04:39:06Z: Loading plugins...
verbose: 2024-07-02T04:39:06Z: Connecting to MySQL at: serverconfig@katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local
Traceback (most recent call last):
File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (2013): Shell.connect: Lost connection to MySQL server at 'reading initial communication packet', system error: 104
unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Setting up the logrotate configurations
unit-katib-db-0: 21:39:14 INFO unit.katib-db/0.juju-log Adding pebble layer
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Unit workload member-state is offline with member-role unknown
unit-katib-db-0: 21:39:22 INFO unit.katib-db/0.juju-log Attempting reboot from complete outage.
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow uk8sx my-k8s/localhost 2.9.49 unsupported 21:42:42-07:00
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook res:oci-image@2d74d1b active 1 admission-webhook 1.7/stable 224 10.152.183.247 no
argo-controller res:oci-image@3902c16 active 1 argo-controller 3.3/stable 376 no
argo-server res:oci-image@e2292c9 active 1 argo-server 3.3/stable 309 no
dex-auth active 1 dex-auth 2.31/stable 389 10.152.183.43 no
istio-ingressgateway active 1 istio-gateway 1.16/stable 1005 10.152.183.29 no
istio-pilot active 1 istio-pilot 1.16/stable 662 10.152.183.128 no
jupyter-controller res:oci-image@1167186 active 1 jupyter-controller 1.7/stable 805 no
jupyter-ui active 1 jupyter-ui 1.7/stable 781 10.152.183.198 no
katib-controller res:oci-image@111495a active 1 katib-controller 0.15/stable 282 10.152.183.65 no
katib-db 8.0.36-0ubuntu0.22.04.1 waiting 1 mysql-k8s 8.0/stable 153 10.152.183.12 no installing agent
katib-db-manager active 1 katib-db-manager 0.15/stable 253 10.152.183.151 no
katib-ui active 1 katib-ui 0.15/stable 267 10.152.183.13 no
kfp-api active 1 kfp-api 2.0-alpha.7/stable 935 10.152.183.50 no
kfp-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 153 10.152.183.84 no
kfp-persistence res:oci-image@ebed770 active 1 kfp-persistence 2.0-alpha.7/stable 939 no
kfp-profile-controller res:oci-image@aa75b0c active 1 kfp-profile-controller 2.0-alpha.7/stable 899 10.152.183.56 no
kfp-schedwf res:oci-image@2cb9087 active 1 kfp-schedwf 2.0-alpha.7/stable 952 no
kfp-ui res:oci-image@ae72602 active 1 kfp-ui 2.0-alpha.7/stable 934 10.152.183.217 no
kfp-viewer res:oci-image@899e25f active 1 kfp-viewer 2.0-alpha.7/stable 964 no
kfp-viz res:oci-image@ffaf37e active 1 kfp-viz 2.0-alpha.7/stable 889 10.152.183.70 no
knative-eventing active 1 knative-eventing 1.8/stable 345 10.152.183.174 no
knative-operator active 1 knative-operator 1.8/stable 320 10.152.183.208 no
knative-serving active 1 knative-serving 1.8/stable 346 10.152.183.73 no
kserve-controller active 1 kserve-controller 0.10/stable 458 10.152.183.177 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.7/stable 439 10.152.183.206 no
kubeflow-profiles active 1 kubeflow-profiles 1.7/stable 336 10.152.183.112 no
kubeflow-roles active 1 kubeflow-roles 1.7/stable 148 10.152.183.3 no
kubeflow-volumes res:oci-image@d261609 active 1 kubeflow-volumes 1.7/stable 204 10.152.183.41 no
metacontroller-operator active 1 metacontroller-operator 2.0/stable 204 10.152.183.120 no
minio res:oci-image@1755999 active 1 minio ckf-1.7/stable 214 10.152.183.121 no
oidc-gatekeeper res:oci-image@7aae6d7 active 1 oidc-gatekeeper ckf-1.7/stable 320 10.152.183.75 no
seldon-controller-manager active 1 seldon-core 1.15/stable 548 10.152.183.59 no
tensorboard-controller res:oci-image@c52f7c2 active 1 tensorboard-controller 1.7/stable 156 10.152.183.71 no
tensorboards-web-app res:oci-image@929f55b active 1 tensorboards-web-app 1.7/stable 158 10.152.183.115 no
training-operator active 1 training-operator 1.6/stable 305 10.152.183.76 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.1.121.226 4443/TCP
argo-controller/0* active idle 10.1.69.129
argo-server/0* active idle 10.1.121.229 2746/TCP
dex-auth/0* active idle 10.1.121.204
istio-ingressgateway/0* active idle 10.1.121.205
istio-pilot/0* active idle 10.1.69.161
jupyter-controller/0* active idle 10.1.121.231
jupyter-ui/0* active idle 10.1.69.164
katib-controller/0* active idle 10.1.121.230 443/TCP,8080/TCP
katib-db-manager/0* active idle 10.1.69.142
katib-db/0* maintenance idle 10.1.69.167 offline
katib-ui/0* active idle 10.1.121.208
kfp-api/0* active idle 10.1.121.209
kfp-db/0* active idle 10.1.121.211 Primary
It got restarted automatically and came up properly
@shayancanonical maybe you should also take a look at this one
But now I am having issue with the kubeflow UI
When I login with the weblink http://10.10.26.236:31456/ I land up in below page
Clicked "start setup"
Now once I click finish button nothing happens. I am not getting redirected to next page. But kubectl says profile is created.
$ kubectl get profiles NAME AGE admin 20m
On server reboot, juju status was showing multiple units in "agent lost, see 'juju show-status-log" state. Restarted respective pods which are in agent lost state that includes dex. I again tried logging into kubeflow ui and this time I have passed beyond initial startup windows.
Bug Description
As shown in below juju status, katib-db-manager unit is stuck with " hook failed: "update-status""
To Reproduce
Environment
Relevant Log Output
logs_2.zip
Additional Context
No response