Closed NohaIhab closed 3 months ago
From a quick research, it seems like these kind of issues are associated to the katib-db-manager not being up and ready when running experiments. Do you know if it was in fact active and idle, and the pods were also active and ready when you tried running the experiment?
Similar issue: https://github.com/kubeflow/katib/issues/1517
Could not deploy katib
from latest/edge
. Will verify in the complete KF deployment.
Model Controller Cloud/Region Version SLA Timestamp
kubeflow microk8s-localhost microk8s/localhost 2.9.44 unsupported 16:32:59-04:00
App Version Status Scale Charm Channel Rev Address Exposed Message
katib-controller res:oci-image@111495a active 1 katib-controller edge 341 10.152.183.245 no
katib-db mariadb/server:10.3 active 1 mariadb-k8s stable 35 10.152.183.96 no ready
katib-db-manager waiting 1 katib-db-manager edge 309 10.152.183.198 no installing agent
katib-ui waiting 1 katib-ui edge 320 10.152.183.80 no installing agent
Unit Workload Agent Address Ports Message
katib-controller/0* active idle 10.1.59.81 443/TCP,8080/TCP
katib-db-manager/0* error idle 10.1.59.76 hook failed: "install"
katib-db/0* active idle 10.1.59.80 3306/TCP ready
katib-ui/0* blocked idle 10.1.59.78 kubernetes resource creation failed
@DnPlas katib-db-manager
charm is active/idle
and the pod is ready
logs from katib-db-manager
container:
2023-10-04T08:44:54.097Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-10-04T08:44:54.102Z [katib-db-manager] I1004 08:44:54.102216 43 db.go:32] Using MySQL
2023-10-04T08:44:59.122Z [katib-db-manager] I1004 08:44:59.122380 43 init.go:27] Initializing v1beta1 DB schema
2023-10-04T08:44:59.127Z [katib-db-manager] I1004 08:44:59.127916 43 main.go:113] Start Katib manager: 0.0.0.0:6789
2023-10-04T08:49:25.614Z [pebble] GET /v1/plan?format=yaml 235.252µs 200
2023-10-04T08:54:27.665Z [pebble] GET /v1/plan?format=yaml 149.054µs 200
2023-10-04T08:59:30.039Z [pebble] GET /v1/plan?format=yaml 543.314µs 200
2023-10-04T09:05:06.070Z [pebble] GET /v1/plan?format=yaml 265.883µs 200
2023-10-04T09:09:56.200Z [pebble] GET /v1/plan?format=yaml 172.92µs 200
2023-10-04T09:15:28.309Z [pebble] GET /v1/plan?format=yaml 152.975µs 200
2023-10-04T09:20:14.027Z [pebble] GET /v1/plan?format=yaml 193.914µs 200
2023-10-04T09:26:01.311Z [pebble] GET /v1/plan?format=yaml 195.693µs 200
2023-10-04T09:31:34.287Z [pebble] GET /v1/plan?format=yaml 289.235µs 200
2023-10-04T09:35:34.871Z [pebble] GET /v1/plan?format=yaml 204.887µs 200
2023-10-04T09:40:25.212Z [pebble] GET /v1/plan?format=yaml 337.025µs 200
2023-10-04T09:46:19.174Z [pebble] GET /v1/plan?format=yaml 749.551µs 200
2023-10-04T09:51:20.886Z [pebble] GET /v1/plan?format=yaml 160.222µs 200
2023-10-04T09:57:04.651Z [pebble] GET /v1/plan?format=yaml 201.391µs 200
2023-10-04T10:02:33.420Z [pebble] GET /v1/plan?format=yaml 181.847µs 200
2023-10-04T10:08:26.583Z [pebble] GET /v1/plan?format=yaml 135.547µs 200
2023-10-04T10:13:08.665Z [pebble] GET /v1/plan?format=yaml 154.588µs 200
2023-10-04T10:18:45.564Z [pebble] GET /v1/plan?format=yaml 185.666µs 200
2023-10-04T10:24:19.338Z [pebble] GET /v1/plan?format=yaml 180.573µs 200
2023-10-04T10:28:55.284Z [pebble] GET /v1/plan?format=yaml 139.591µs 200
2023-10-04T10:33:46.465Z [pebble] GET /v1/plan?format=yaml 155.926µs 200
2023-10-04T10:38:54.247Z [pebble] GET /v1/plan?format=yaml 204.563µs 200
2023-10-04T10:44:02.575Z [pebble] GET /v1/plan?format=yaml 193.604µs 200
2023-10-04T10:49:00.315Z [pebble] GET /v1/plan?format=yaml 233.411µs 200
2023-10-04T10:53:06.184Z [pebble] GET /v1/plan?format=yaml 188.678µs 200
2023-10-04T10:58:02.907Z [pebble] GET /v1/plan?format=yaml 172.183µs 200
2023-10-04T11:02:47.260Z [pebble] GET /v1/plan?format=yaml 392.891µs 200
2023-10-04T11:07:02.473Z [pebble] GET /v1/plan?format=yaml 190.738µs 200
2023-10-04T11:11:08.045Z [pebble] GET /v1/plan?format=yaml 196.818µs 200
2023-10-04T11:16:59.296Z [pebble] GET /v1/plan?format=yaml 161.081µs 200
2023-10-04T11:21:03.021Z [pebble] GET /v1/plan?format=yaml 223.954µs 200
2023-10-04T11:25:12.536Z [pebble] GET /v1/plan?format=yaml 207.933µs 200
2023-10-04T11:29:30.744Z [pebble] GET /v1/plan?format=yaml 178.44µs 200
2023-10-04T11:33:54.612Z [pebble] GET /v1/plan?format=yaml 186.62µs 200
2023-10-04T11:39:29.139Z [pebble] GET /v1/plan?format=yaml 319.879µs 200
2023-10-04T11:44:26.934Z [pebble] GET /v1/plan?format=yaml 198.789µs 200
2023-10-04T11:49:00.185Z [pebble] GET /v1/plan?format=yaml 453.48µs 200
2023-10-04T11:54:35.465Z [pebble] GET /v1/plan?format=yaml 202.096µs 200
2023-10-04T11:58:45.953Z [pebble] GET /v1/plan?format=yaml 185.96µs 200
2023-10-04T12:04:30.993Z [pebble] GET /v1/plan?format=yaml 228.332µs 200
2023-10-04T12:10:28.627Z [pebble] GET /v1/plan?format=yaml 16.24712ms 200
2023-10-04T12:16:04.205Z [pebble] GET /v1/plan?format=yaml 261.961µs 200
2023-10-04T12:21:57.655Z [pebble] GET /v1/plan?format=yaml 343.916µs 200
I was not able to reproduce this issue, we should re-open the issue if we hit it again and document the steps to reproduce.
Came across this again when running the katib uat notebook. Then, I tested also with grid example and same thing happened there as well. Note though that this happened after I had scaled down the cluster to 0 nodes yesterday before EOD and scaled it up again today.
latest/edge
bundle to EKS 1.25 using Juju 3.1 (and following this guide)
I redeployed to a new cluster with the same set up and looks like the issue didn't come up as well.
This is the last line in the logs in every trial pod spun by the katib experiment.
F1122 09:45:46.259588 21 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.8.221:65535: i/o timeout"
As @DnPlas pointed out, this is the pod trying to contact katib-db-manager (10.100.8.221).
However, katib-db-manager is up and running. From logs below, we see that katib-db-manager
has a hard time talking to the database itself.
Note that the following errors are present even in the healthy cluster
2023-11-22T15:33:40.738Z [container-agent] 2023-11-22 15:33:40 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection
I also noticed katib-db
going to Maintenance
state with message offline
. It eventually went to active by itself. Looking at its logs:
Here are also the full kubernetes logs. Not sure what we can conclude from the above. This could be a mysql-k8s charm issue but we can't be sure.
I have also faced similar issue with katib.
These are the last logs outputs from the katib ui trials:
I1123 08:46:42.370564 14 main.go:139] 2023-11-23T08:46:42Z INFO loss=0.49476566910743713 I1123 08:46:42.370576 14 main.go:139] 2023-11-23T08:46:42Z INFO categorical_accuracy=0.8163449764251709 I1123 08:46:42.370581 14 main.go:139] 2023-11-23T08:46:42Z INFO same_precision=0.6819637417793274 I1123 08:46:42.370621 14 main.go:139] 2023-11-23T08:46:42Z INFO same_recall=0.7400115728378296 I1123 08:46:42.370631 14 main.go:139] 2023-11-23T08:46:42Z INFO val_loss=0.4788762629032135 I1123 08:46:42.370650 14 main.go:139] 2023-11-23T08:46:42Z INFO val_categorical_accuracy=0.8287093043327332 I1123 08:46:42.370728 14 main.go:139] 2023-11-23T08:46:42Z INFO val_same_precision=0.6882216930389404 I1123 08:46:42.370845 14 main.go:139] 2023-11-23T08:46:42Z INFO val_same_recall=0.7760416865348816 F1123 08:47:06.643975 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.152.183.250:65535: i/o timeout"
@Daard Could you let us know a bit more about the environment and deployment you had during the above error?
Environment
Used this guide to deploy charmed kubeflow.
Logs
Added logs from the trial already.
What I have done?
I have created custom TFJob which is running and completed successfully. But it does not work as katib experiment.
I have also tested several other experiments configurations from kubeflow/katib documentation. But they have the same behaviour.
@orfeas-k Do you need some additional logs for understanding the reason?
@orfeas-k After katib experiment deletion the trials remain in my namespace. And I can't delete them also. Even after deletion of all resources which are connected to the experiment (pods, experiments, suggestions).
Could you post all logs from katib-db-manager-0
and katib-db-0
pods? You can use
kubectl -nkubeflow logs katib-db-manager-0 --all-containers
kubectl -nkubeflow logs katib-db-0 --all-containers
@orfeas-k Sure.
Thank you @Daard. We would like to understand better who exactly is trying to contact katib-db-manager
and fails. I think the trial pod has a metrics-collector container as well and we would like to see if it's this one that sends the request that timeouts. Could you post the logs from the trial pod using --all-containers
?
@orfeas-k When I tried to get logs from kunectl I have got this:
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86 2/2 Running 16 (5d15h ago) 16d
ml-pipeline-visualizationserver-677c86b748-gqw4k 2/2 Running 12 (5d16h ago) 16d
tboard-77d56648ff-dqfz8 2/2 Running 6 (5d16h ago) 7d23h
resell-lab-0 2/2 Running 0 20h
exp-resell-dd-grid-6b5bc56d94-tdt8k 1/1 Running 0 27m
exp-resell-dd-qp9fnnzt-worker-0 2/2 Running 0 2m12s
exp-resell-dd-dmnkgt8f-worker-0 1/2 NotReady 0 2m14s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-dd-dmnkgt8f-worker-0 > worker.txt
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector
But I can see logs from tfjob and they are similar to logs from kubeflow.katib.trial ui:
I1123 10:53:59.433666 14 main.go:396] Trial Name: exp-resell-dd-qp9fnnzt
I1123 10:54:02.806239 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature label has a shape dim {
I1123 10:54:02.806274 14 main.go:139] size: 1
I1123 10:54:02.806314 14 main.go:139] }
I1123 10:54:02.806333 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:02.806387 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature left_agency has a shape dim {
I1123 10:54:02.806405 14 main.go:139] size: 1
I1123 10:54:02.806419 14 main.go:139] }
I1123 10:54:02.806425 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:02.806438 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature left_area_kitchen has a shape dim {
I1123 10:54:02.806450 14 main.go:139] size: 1
I1123 10:54:02.806459 14 main.go:139] }
I1123 10:54:02.806465 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:02.806474 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature left_area_living has a shape dim {
I1123 10:54:02.806489 14 main.go:139] size: 1
I1123 10:54:02.806499 14 main.go:139] }
I1123 10:54:02.806505 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:02.806513 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature left_area_total has a shape dim {
I1123 10:54:02.806518 14 main.go:139] size: 1
I1123 10:54:02.806525 14 main.go:139] }
I1123 10:54:02.806541 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:02.806557 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature left_building has a shape dim {
I1123 10:54:02.806567 14 main.go:139] size: 1
I1123 10:54:02.806580 14 main.go:139] }
I1123 10:54:02.806584 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:02.806728 14 main.go:139] 2023-11-23T10:54:02Z INFO Feature left_built_year has a shape dim {
I1123 10:54:02.806747 14 main.go:139] size: 1
I1123 10:54:02.806762 14 main.go:139] }
I1123 10:54:02.806775 14 main.go:139] . Setting to DenseTensor.
I1123 10:54:20.135699 14 main.go:139] /usr/local/lib/python3.8/dist-packages/keras/src/engine/functional.py:639: UserWarning: Input dict contained keys ['left_page_id', 'left_site', 'right_page_id', 'right_site'] which did not match any model input. They will be ignored by the model.
I1123 10:54:20.135737 14 main.go:139] inputs = self._flatten_to_reference_inputs(inputs)
I1123 10:54:20.135760 14 main.go:139] WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0300s vs `on_train_batch_end` time: 0.0376s). Check your callbacks.
I1123 10:54:23.616630 14 main.go:139] 2023-11-23T10:54:23Z INFO epoch 1:
I1123 10:55:22.620524 14 main.go:139] 2023-11-23T10:55:22Z INFO categorical_accuracy=0.8045838475227356
I1123 10:55:22.620544 14 main.go:139] 2023-11-23T10:55:22Z INFO same_precision=0.6760939359664917
I1123 10:55:22.620563 14 main.go:139] 2023-11-23T10:55:22Z INFO same_recall=0.727324903011322
I1123 10:55:22.620576 14 main.go:139] 2023-11-23T10:55:22Z INFO val_loss=0.5009434223175049
I1123 10:55:22.620695 14 main.go:139] 2023-11-23T10:55:22Z INFO val_categorical_accuracy=0.8238841891288757
I1123 10:55:22.620801 14 main.go:139] 2023-11-23T10:55:22Z INFO val_same_precision=0.607390284538269
I1123 10:55:22.620865 14 main.go:139] 2023-11-23T10:55:22Z INFO val_same_recall=0.7758111953735352
I1123 10:55:24.845616 14 main.go:139] 2023-11-23T10:55:24Z INFO epoch 27:
I1123 10:55:24.845646 14 main.go:139] 2023-11-23T10:55:24Z INFO loss=0.519189178943634
I1123 10:55:24.845659 14 main.go:139] 2023-11-23T10:55:24Z INFO categorical_accuracy=0.8094089031219482
I1123 10:55:24.845672 14 main.go:139] 2023-11-23T10:55:24Z INFO same_precision=0.6472785472869873
I1123 10:55:24.845743 14 main.go:139] 2023-11-23T10:55:24Z INFO same_recall=0.7418960332870483
I1123 10:55:24.845861 14 main.go:139] 2023-11-23T10:55:24Z INFO val_loss=0.49683240056037903
I1123 10:55:24.845900 14 main.go:139] 2023-11-23T10:55:24Z INFO val_categorical_accuracy=0.8262967467308044
I1123 10:55:24.845914 14 main.go:139] 2023-11-23T10:55:24Z INFO val_same_precision=0.6697459816932678
I1123 10:55:24.846021 14 main.go:139] 2023-11-23T10:55:24Z INFO val_same_recall=0.7651715278625488
I1123 10:55:27.488730 14 main.go:139] 2023-11-23T10:55:27Z INFO epoch 28:
I1123 10:55:27.488755 14 main.go:139] 2023-11-23T10:55:27Z INFO loss=0.517208993434906
I1123 10:55:27.488795 14 main.go:139] 2023-11-23T10:55:27Z INFO categorical_accuracy=0.8075995445251465
I1123 10:55:27.488806 14 main.go:139] 2023-11-23T10:55:27Z INFO same_precision=0.6600853800773621
I1123 10:55:27.488857 14 main.go:139] 2023-11-23T10:55:27Z INFO same_recall=0.7438364624977112
I1123 10:55:27.488940 14 main.go:139] 2023-11-23T10:55:27Z INFO val_loss=0.49721571803092957
I1123 10:55:27.489047 14 main.go:139] 2023-11-23T10:55:27Z INFO val_categorical_accuracy=0.8250904679298401
I1123 10:55:27.489056 14 main.go:139] 2023-11-23T10:55:27Z INFO val_same_precision=0.6697459816932678
I1123 10:55:27.489163 14 main.go:139] 2023-11-23T10:55:27Z INFO val_same_recall=0.7591623067855835
I1123 10:55:29.691004 14 main.go:139] 2023-11-23T10:55:29Z INFO epoch 29:
I1123 10:55:29.691056 14 main.go:139] 2023-11-23T10:55:29Z INFO loss=0.5149072408676147
I1123 10:55:29.691077 14 main.go:139] 2023-11-23T10:55:29Z INFO categorical_accuracy=0.8060916662216187
I1123 10:55:29.691171 14 main.go:139] 2023-11-23T10:55:29Z INFO same_precision=0.6648879647254944
I1123 10:55:29.691262 14 main.go:139] 2023-11-23T10:55:29Z INFO same_recall=0.7316500544548035
I1123 10:55:29.691357 14 main.go:139] 2023-11-23T10:55:29Z INFO val_loss=0.49219810962677
Hmm could you try and rerun a Katib experiment and provide logs using --all-containers
while the pod is up and in error state? IIRC, once the Katib experiment has completed running, the trial's pod is deleted, thus you cannot see the pod in your namespace.
Do I understand correctly? The trial pod it is separate pod and is not worker pod which was built by me and has such metrics?
I1123 10:54:23.616630 14 main.go:139] 2023-11-23T10:54:23Z INFO epoch 1:
I1123 10:54:23.616663 14 main.go:139] 2023-11-23T10:54:23Z INFO loss=1.7035995721817017
I1123 10:54:23.616686 14 main.go:139] 2023-11-23T10:54:23Z INFO categorical_accuracy=0.5815742015838623
I1123 10:54:23.616696 14 main.go:139] 2023-11-23T10:54:23Z INFO same_precision=0.46905016899108887
I1123 10:54:23.616722 14 main.go:139] 2023-11-23T10:54:23Z INFO same_recall=0.40713292360305786
I1123 10:54:23.616879 14 main.go:139] 2023-11-23T10:54:23Z INFO val_loss=99.96876525878906
I1123 10:54:23.616904 14 main.go:139] 2023-11-23T10:54:23Z INFO val_categorical_accuracy=0.6779252290725708
I1123 10:54:23.617044 14 main.go:139] 2023-11-23T10:54:23Z INFO val_same_precision=0.18244802951812744
I1123 10:54:23.617147 14 main.go:139] 2023-11-23T10:54:23Z INFO val_same_recall=0.585185170173645
Cause during the experiment is running there are only such pods:
ml-pipeline-ui-artifact-c4969b95b-6bj86 2/2 Running 16 (5d15h ago) 16d
ml-pipeline-visualizationserver-677c86b748-gqw4k 2/2 Running 12 (5d16h ago) 16d
tboard-77d56648ff-dqfz8 2/2 Running 6 (5d16h ago) 7d23h
resell-lab-0 2/2 Running 0 20h
exp-resell-dd-grid-6b5bc56d94-tdt8k 1/1 Running 0 27m
exp-resell-dd-qp9fnnzt-worker-0 2/2 Running 0 2m12s
exp-resell-dd-dmnkgt8f-worker-0 1/2 NotReady 0 2m14s
There are also trials like this:
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs trial.kubeflow.org/exp-resell-dd-qp9fnnzt
error: no kind "Trial" is registered for version "kubeflow.org/v1beta1" in scheme "pkg/scheme/scheme.go:28"
@Daard I 'll get back to your previous comment soon. In the meantime, could you post the output of juju status katib-db
command?
juju status katib-db
Model Controller Cloud/Region Version SLA Timestamp
kubeflow my-controller myk8s/localhost 3.1.6 unsupported 13:10:12Z
App Version Status Scale Charm Channel Rev Address Exposed Message
katib-db 8.0.34-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 99 10.152.183.35 no
Unit Workload Agent Address Ports Message
katib-db/0* active idle 10.1.183.72 Primary
Thank you for your effort in debugging this @Daard. The trial pods are pods spun up by the experiment while it's running. IIUC, in your case the experiment pod is exp-resell-dd-grid-6b5bc56d94-tdt8k
and the trial pods are:
exp-resell-dd-qp9fnnzt-worker-0 2/2 Running 0 2m12s
exp-resell-dd-dmnkgt8f-worker-0 1/2 NotReady 0 2m14s
Are those completing successfully or go into Error status? We want the logs from those once they are in Error state.
I will restart experiment soon. But it needs 20-30 minutes to start my workers. Is it normal behaviour by the way? Cause tfjob runs almost instantly.
After trial is failed my workers are gone and in the ui I can see such message:
Failed to find logs for this Trial.
Make sure you've set "spec.trialTemplate.retain" field to "true" in the Experiment definition.
If this error persists then the Pod's logs are not currently persisted in the cluster.
The yaml output says this:
- type: Failed
status: 'True'
reason: 'TrialFailed. Job reason: TFJobFailed'
message: >-
Trial has failed. Job message: TFJob my-namespace/exp-resell-dd-dmnkgt8f
has failed because 1 Worker replica(s) failed.
lastUpdateTime: '2023-11-23T11:02:53Z'
lastTransitionTime: '2023-11-23T11:02:53Z'
I will try to increase replicas count. Maybe it will help to get error logs.
I have got some logs:
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86 2/2 Running 16 (5d19h ago) 16d
ml-pipeline-visualizationserver-677c86b748-gqw4k 2/2 Running 12 (5d19h ago) 16d
tboard-77d56648ff-dqfz8 2/2 Running 6 (5d19h ago) 8d
exp-resell-ee-grid-667cfd4548-btbnh 1/1 Running 0 24m
exp-resell-ee-pdrnc575-worker-1 0/1 Completed 0 3m3s
exp-resell-ee-2pxhvm8q-worker-1 1/1 Running 0 2m59s
exp-resell-ee-pdrnc575-worker-0 0/2 Error 2 (28s ago) 3m4s
exp-resell-ee-2pxhvm8q-worker-0 0/2 Error 3 (30s ago) 3m1s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-ee-pdrnc575-worker-0
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector
NAME READY STATUS RESTARTS AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86 2/2 Running 16 (5d19h ago) 16d
ml-pipeline-visualizationserver-677c86b748-gqw4k 2/2 Running 12 (5d19h ago) 16d
tboard-77d56648ff-dqfz8 2/2 Running 6 (5d19h ago) 8d
exp-resell-ee-grid-667cfd4548-btbnh 1/1 Running 0 25m
exp-resell-ee-pdrnc575-worker-1 0/1 Completed 0 3m47s
exp-resell-ee-2pxhvm8q-worker-1 1/1 Running 0 3m43s
exp-resell-ee-pdrnc575-worker-0 0/2 CrashLoopBackOff 3 (30s ago) 3m48s
exp-resell-ee-2pxhvm8q-worker-0 0/2 Error 4 (48s ago) 3m45s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-ee-pdrnc575-worker-0
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86 2/2 Running 16 (5d19h ago) 16d
ml-pipeline-visualizationserver-677c86b748-gqw4k 2/2 Running 12 (5d19h ago) 16d
tboard-77d56648ff-dqfz8 2/2 Running 6 (5d19h ago) 8d
exp-resell-ee-grid-667cfd4548-btbnh 1/1 Running 0 28m
exp-resell-ee-pdrnc575-worker-1 0/1 Completed 0 7m36s
exp-resell-ee-2pxhvm8q-worker-1 0/1 Completed 0 7m32s
exp-resell-ee-2pxhvm8q-worker-0 0/2 CrashLoopBackOff 5 (2m25s ago) 7m34s
exp-resell-ee-pdrnc575-worker-0 0/2 CrashLoopBackOff 5 (2m16s ago) 7m37s
After I increased replicas count some trials were completed. They outputs is similar to tfjob log.
But inside UI I got such output in the end:
I1123 14:38:53.968930 5790 main.go:139] 2023-11-23T14:32:41Z INFO epoch 31:
I1123 14:38:53.968937 5790 main.go:139] 2023-11-23T14:32:41Z INFO loss=0.5217770934104919
I1123 14:38:53.968944 5790 main.go:139] 2023-11-23T14:32:41Z INFO categorical_accuracy=0.8083534240722656
I1123 14:38:53.968949 5790 main.go:139] 2023-11-23T14:32:41Z INFO same_precision=0.6734258532524109
I1123 14:38:53.968956 5790 main.go:139] 2023-11-23T14:32:41Z INFO same_recall=0.7432273030281067
I1123 14:38:53.968963 5790 main.go:139] 2023-11-23T14:32:41Z INFO val_loss=0.5091827511787415
I1123 14:38:53.968975 5790 main.go:139] 2023-11-23T14:32:41Z INFO val_categorical_accuracy=0.8226779103279114
I1123 14:38:53.968989 5790 main.go:139] 2023-11-23T14:32:41Z INFO val_same_precision=0.6974595785140991
F1123 14:38:53.968994 5790 main.go:421] Failed to wait for worker container: unable to find main pid from the process list [1 5790]
I1123 14:38:53.969127 5790 main.go:139] 2023-11-23T14:32:41Z INFO val_same_recall=0.7512437701225281
I1123 14:38:53.969137 5790 main.go:139] 2023-11-23T14:32:43Z INFO epoch 32:
Now the experiment is stuck.
What we need is logs from pods that are in Error state using --all-containers
meaning to run kubectl -n my-namespace logs pod-name --all-containers
.
I did not catch error state of the worker only crashloop.
Is it crucial? Or you may help with such logs?
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME READY STATUS RESTARTS AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86 2/2 Running 16 (5d20h ago) 16d
ml-pipeline-visualizationserver-677c86b748-gqw4k 2/2 Running 12 (5d21h ago) 16d
tboard-77d56648ff-dqfz8 2/2 Running 6 (5d21h ago) 8d
exp-resell-ff-grid-77955875cb-sl7j4 1/1 Running 0 40m
resell-lab-0 2/2 Running 0 16m
exp-resell-ff-5crjfw69-worker-1 0/1 Pending 0 10m
exp-resell-ff-cc4mjp6j-worker-1 0/1 Completed 0 10m
exp-resell-ff-cc4mjp6j-worker-0 0/2 CrashLoopBackOff 6 (2m52s ago) 11m
exp-resell-ff-5crjfw69-worker-0 0/2 CrashLoopBackOff 6 (22s ago) 10m
kubectl -n my-namespace logs exp-resell-ff-cc4mjp6j-worker-0 --all-containers > error.logs
error.txt
@orfeas-k Hello again. I am not familiar with k8s and kubeflow, thus did not notice that I can add --all-containers arg and it would get some impact. But I have added all logs from error containers in the err2.txt file. Hope it will help to debug katib.
I have faced similar issue with this example .
The trial pods are now stuck in pending state.
I hope you will find the problem. If you need any additional logs I will send easily.
@Daard could you try removing the katib-db
charm and redeploying it with the following commands? Let's see if this will unblock new experiments. Please note that this will delete any data related to Katib experiments you 've already ran, so you should avoid doing that if you have any important data for ran experiments.
juju remove-application katib-db
# Wait for it to be removed
juju deploy mysql-k8s katib-db --channel 8.0/stable --trust --constraints mem=2G
juju relate katib-db-manager:relational-db katib-db:database
@orfeas-k Is it normal? juju_status.txt
The logs from katib-db-manager:
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken data = self._lazy_data = self._load()
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 1378, in _load
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken return self._backend.relation_get(self.relation.id, self._entity.name, self._is_app)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 2697, in relation_get
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken raw_data_content = self._run(*args, return_output=True, use_json=True)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 2618, in _run
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken raise ModelError(e.stderr)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken ops.model.ModelError: ERROR permission denied
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken
2023-11-24T13:05:31.810Z [container-agent] 2023-11-24 13:05:31 ERROR juju.worker.uniter.operation runhook.go:180 hook "relational-db-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1
2023-11-24T13:05:31.812Z [container-agent] 2023-11-24 13:05:31 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.228Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.228Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.299Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:09:27.243Z [container-agent] 2023-11-24 13:09:27 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
Filed an issue in mysql-k8s
https://github.com/canonical/mysql-k8s-operator/issues/341
Received a response https://github.com/canonical/mysql-k8s-operator/issues/341#issuecomment-1836519425 that mentions that mysql-k8s
has an issue scaling up after it was scaled to 0 units and they 're working on a solution. If the root cause of the issue in Katib is katib-db
not being responsive to katib-db-manager
, then it will be resolved once they push a fix.
After all, this should be the same issue we hit https://github.com/canonical/bundle-kubeflow/issues/893 and is described in detail in https://github.com/canonical/bundle-kubeflow/issues/893#issuecomment-2142022677.
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5873.
This message was autogenerated
Hit this issue while defining tests for CKF 1.8/stable in air-gapped (https://github.com/canonical/bundle-kubeflow/issues/918), PR #192 should resolve it.
Steps to reproduce
Result: pods go to error at the final stage of communicating results with the logs:
10.152.183.167
is the IP ofkatib-db-manager
ClusterIP Service. The same logs can be viewed from thekatib-controller
container logs.