`Failed to Report logs` error in experiment Pod on latest/edge

NohaIhab commented 1 year ago

Steps to reproduce

deploy katib bundle latest/edge
create experiment with grid example
wait until a trial pod finishes

Result: pods go to error at the final stage of communicating results with the logs:

F0808 07:54:54.977885      20 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.152.183.167:65535: i/o timeout"

10.152.183.167 is the IP of katib-db-manager ClusterIP Service. The same logs can be viewed from the katib-controller container logs.

DnPlas commented 1 year ago

From a quick research, it seems like these kind of issues are associated to the katib-db-manager not being up and ready when running experiments. Do you know if it was in fact active and idle, and the pods were also active and ready when you tried running the experiment?

i-chvets commented 1 year ago

Could not deploy katib from latest/edge. Will verify in the complete KF deployment.

Model     Controller          Cloud/Region        Version  SLA          Timestamp
kubeflow  microk8s-localhost  microk8s/localhost  2.9.44   unsupported  16:32:59-04:00

App               Version                Status   Scale  Charm             Channel  Rev  Address         Exposed  Message
katib-controller  res:oci-image@111495a  active       1  katib-controller  edge     341  10.152.183.245  no       
katib-db          mariadb/server:10.3    active       1  mariadb-k8s       stable    35  10.152.183.96   no       ready
katib-db-manager                         waiting      1  katib-db-manager  edge     309  10.152.183.198  no       installing agent
katib-ui                                 waiting      1  katib-ui          edge     320  10.152.183.80   no       installing agent

Unit                 Workload  Agent  Address     Ports             Message
katib-controller/0*  active    idle   10.1.59.81  443/TCP,8080/TCP  
katib-db-manager/0*  error     idle   10.1.59.76                    hook failed: "install"
katib-db/0*          active    idle   10.1.59.80  3306/TCP          ready
katib-ui/0*          blocked   idle   10.1.59.78                    kubernetes resource creation failed

NohaIhab commented 11 months ago

@DnPlas katib-db-manager charm is active/idle and the pod is ready logs from katib-db-manager container:

2023-10-04T08:44:54.097Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager
2023-10-04T08:44:54.102Z [katib-db-manager] I1004 08:44:54.102216      43 db.go:32] Using MySQL
2023-10-04T08:44:59.122Z [katib-db-manager] I1004 08:44:59.122380      43 init.go:27] Initializing v1beta1 DB schema
2023-10-04T08:44:59.127Z [katib-db-manager] I1004 08:44:59.127916      43 main.go:113] Start Katib manager: 0.0.0.0:6789
2023-10-04T08:49:25.614Z [pebble] GET /v1/plan?format=yaml 235.252µs 200
2023-10-04T08:54:27.665Z [pebble] GET /v1/plan?format=yaml 149.054µs 200
2023-10-04T08:59:30.039Z [pebble] GET /v1/plan?format=yaml 543.314µs 200
2023-10-04T09:05:06.070Z [pebble] GET /v1/plan?format=yaml 265.883µs 200
2023-10-04T09:09:56.200Z [pebble] GET /v1/plan?format=yaml 172.92µs 200
2023-10-04T09:15:28.309Z [pebble] GET /v1/plan?format=yaml 152.975µs 200
2023-10-04T09:20:14.027Z [pebble] GET /v1/plan?format=yaml 193.914µs 200
2023-10-04T09:26:01.311Z [pebble] GET /v1/plan?format=yaml 195.693µs 200
2023-10-04T09:31:34.287Z [pebble] GET /v1/plan?format=yaml 289.235µs 200
2023-10-04T09:35:34.871Z [pebble] GET /v1/plan?format=yaml 204.887µs 200
2023-10-04T09:40:25.212Z [pebble] GET /v1/plan?format=yaml 337.025µs 200
2023-10-04T09:46:19.174Z [pebble] GET /v1/plan?format=yaml 749.551µs 200
2023-10-04T09:51:20.886Z [pebble] GET /v1/plan?format=yaml 160.222µs 200
2023-10-04T09:57:04.651Z [pebble] GET /v1/plan?format=yaml 201.391µs 200
2023-10-04T10:02:33.420Z [pebble] GET /v1/plan?format=yaml 181.847µs 200
2023-10-04T10:08:26.583Z [pebble] GET /v1/plan?format=yaml 135.547µs 200
2023-10-04T10:13:08.665Z [pebble] GET /v1/plan?format=yaml 154.588µs 200
2023-10-04T10:18:45.564Z [pebble] GET /v1/plan?format=yaml 185.666µs 200
2023-10-04T10:24:19.338Z [pebble] GET /v1/plan?format=yaml 180.573µs 200
2023-10-04T10:28:55.284Z [pebble] GET /v1/plan?format=yaml 139.591µs 200
2023-10-04T10:33:46.465Z [pebble] GET /v1/plan?format=yaml 155.926µs 200
2023-10-04T10:38:54.247Z [pebble] GET /v1/plan?format=yaml 204.563µs 200
2023-10-04T10:44:02.575Z [pebble] GET /v1/plan?format=yaml 193.604µs 200
2023-10-04T10:49:00.315Z [pebble] GET /v1/plan?format=yaml 233.411µs 200
2023-10-04T10:53:06.184Z [pebble] GET /v1/plan?format=yaml 188.678µs 200
2023-10-04T10:58:02.907Z [pebble] GET /v1/plan?format=yaml 172.183µs 200
2023-10-04T11:02:47.260Z [pebble] GET /v1/plan?format=yaml 392.891µs 200
2023-10-04T11:07:02.473Z [pebble] GET /v1/plan?format=yaml 190.738µs 200
2023-10-04T11:11:08.045Z [pebble] GET /v1/plan?format=yaml 196.818µs 200
2023-10-04T11:16:59.296Z [pebble] GET /v1/plan?format=yaml 161.081µs 200
2023-10-04T11:21:03.021Z [pebble] GET /v1/plan?format=yaml 223.954µs 200
2023-10-04T11:25:12.536Z [pebble] GET /v1/plan?format=yaml 207.933µs 200
2023-10-04T11:29:30.744Z [pebble] GET /v1/plan?format=yaml 178.44µs 200
2023-10-04T11:33:54.612Z [pebble] GET /v1/plan?format=yaml 186.62µs 200
2023-10-04T11:39:29.139Z [pebble] GET /v1/plan?format=yaml 319.879µs 200
2023-10-04T11:44:26.934Z [pebble] GET /v1/plan?format=yaml 198.789µs 200
2023-10-04T11:49:00.185Z [pebble] GET /v1/plan?format=yaml 453.48µs 200
2023-10-04T11:54:35.465Z [pebble] GET /v1/plan?format=yaml 202.096µs 200
2023-10-04T11:58:45.953Z [pebble] GET /v1/plan?format=yaml 185.96µs 200
2023-10-04T12:04:30.993Z [pebble] GET /v1/plan?format=yaml 228.332µs 200
2023-10-04T12:10:28.627Z [pebble] GET /v1/plan?format=yaml 16.24712ms 200
2023-10-04T12:16:04.205Z [pebble] GET /v1/plan?format=yaml 261.961µs 200
2023-10-04T12:21:57.655Z [pebble] GET /v1/plan?format=yaml 343.916µs 200

NohaIhab commented 11 months ago

I was not able to reproduce this issue, we should re-open the issue if we hit it again and document the steps to reproduce.

orfeas-k commented 10 months ago

Came across this again when running the katib uat notebook. Then, I tested also with grid example and same thing happened there as well. Note though that this happened after I had scaled down the cluster to 0 nodes yesterday before EOD and scaled it up again today.

Environment

latest/edge bundle to EKS 1.25 using Juju 3.1 (and following this guide)

Reproduce

I redeployed to a new cluster with the same set up and looks like the issue didn't come up as well.

Logs/Debugging

Trial logs

This is the last line in the logs in every trial pod spun by the katib experiment.

 F1122 09:45:46.259588      21 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.8.221:65535: i/o timeout"

As @DnPlas pointed out, this is the pod trying to contact katib-db-manager (10.100.8.221).

Full trial logs

``` I1122 09:45:12.241887 21 main.go:396] Trial Name: cmaes-example-d7xk5bv5 I1122 09:45:14.134274 21 main.go:139] 2023-11-22T09:45:14Z INFO start with arguments Namespace(num_classes=10, num_examples=60000, add_stn=False, image_shape='1, 28, 28', network='mlp', num_layers=3, gpus=None, kv_store='device', num_epochs=1, lr=0.04511033252270099, lr_factor=0.1, lr_step_epochs='10', initializer='default', optimizer='adam', mom=0.9, wd=0.0001, batch_size=64, disp_batches=100, model_prefix=None, save_period=1, monitor=0, load_epoch=None, top_k=0, loss='', test_io=0, dtype='float32', gc_type='none', gc_threshold=0.5, macrobatch_size=0, warmup_epochs=5, warmup_strategy='linear', profile_worker_suffix='', profile_server_suffix='', use_imagenet_data_augmentation=0) I1122 09:45:14.147084 21 main.go:139] 2023-11-22T09:45:14Z DEBUG Starting new HTTP connection (1): data.mxnet.io:80 I1122 09:45:14.165668 21 main.go:139] 2023-11-22T09:45:14Z DEBUG http://data.mxnet.io:80 "GET /data/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 301 290 I1122 09:45:14.167552 21 main.go:139] 2023-11-22T09:45:14Z DEBUG Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 I1122 09:45:14.494427 21 main.go:139] 2023-11-22T09:45:14Z DEBUG http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/train-labels-idx1-ubyte.gz HTTP/1.1" 200 28881 I1122 09:45:14.640593 21 main.go:139] 2023-11-22T09:45:14Z INFO downloaded http://data.mxnet.io/data/mnist/train-labels-idx1-ubyte.gz into train-labels-idx1-ubyte.gz successfully I1122 09:45:14.640618 21 main.go:139] 2023-11-22T09:45:14Z DEBUG Starting new HTTP connection (1): data.mxnet.io:80 I1122 09:45:14.654983 21 main.go:139] 2023-11-22T09:45:14Z DEBUG http://data.mxnet.io:80 "GET /data/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 301 290 I1122 09:45:14.657541 21 main.go:139] 2023-11-22T09:45:14Z DEBUG Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 I1122 09:45:14.973142 21 main.go:139] 2023-11-22T09:45:14Z DEBUG http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/train-images-idx3-ubyte.gz HTTP/1.1" 200 9912422 I1122 09:45:16.658386 21 main.go:139] 2023-11-22T09:45:16Z INFO downloaded http://data.mxnet.io/data/mnist/train-images-idx3-ubyte.gz into train-images-idx3-ubyte.gz successfully I1122 09:45:17.142916 21 main.go:139] 2023-11-22T09:45:17Z DEBUG Starting new HTTP connection (1): data.mxnet.io:80 I1122 09:45:17.154624 21 main.go:139] 2023-11-22T09:45:17Z DEBUG http://data.mxnet.io:80 "GET /data/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 301 289 I1122 09:45:17.156007 21 main.go:139] 2023-11-22T09:45:17Z DEBUG Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 I1122 09:45:17.480217 21 main.go:139] 2023-11-22T09:45:17Z DEBUG http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/t10k-labels-idx1-ubyte.gz HTTP/1.1" 200 4542 I1122 09:45:17.481015 21 main.go:139] 2023-11-22T09:45:17Z INFO downloaded http://data.mxnet.io/data/mnist/t10k-labels-idx1-ubyte.gz into t10k-labels-idx1-ubyte.gz successfully I1122 09:45:17.482824 21 main.go:139] 2023-11-22T09:45:17Z DEBUG Starting new HTTP connection (1): data.mxnet.io:80 I1122 09:45:17.494636 21 main.go:139] 2023-11-22T09:45:17Z DEBUG http://data.mxnet.io:80 "GET /data/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 301 289 I1122 09:45:17.498002 21 main.go:139] 2023-11-22T09:45:17Z DEBUG Starting new HTTP connection (1): data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 I1122 09:45:17.818124 21 main.go:139] 2023-11-22T09:45:17Z DEBUG http://data.mxnet.io.s3-website-us-west-1.amazonaws.com:80 "GET /data/mnist/t10k-images-idx3-ubyte.gz HTTP/1.1" 200 1648877 I1122 09:45:18.822646 21 main.go:139] 2023-11-22T09:45:18Z INFO downloaded http://data.mxnet.io/data/mnist/t10k-images-idx3-ubyte.gz into t10k-images-idx3-ubyte.gz successfully I1122 09:45:19.249433 21 main.go:139] [09:45:19] ../src/executor/graph_executor.cc:1991: Subgraph backend MKLDNN is activated. I1122 09:45:19.760514 21 main.go:139] 2023-11-22T09:45:19Z INFO Epoch[0] Batch [0-100] Speed: 15160.65 samples/sec accuracy=0.751238 I1122 09:45:20.027658 21 main.go:139] 2023-11-22T09:45:20Z INFO Epoch[0] Batch [100-200] Speed: 23985.89 samples/sec accuracy=0.869375 I1122 09:45:20.577454 21 main.go:139] 2023-11-22T09:45:20Z INFO Epoch[0] Batch [200-300] Speed: 11675.95 samples/sec accuracy=0.888594 I1122 09:45:21.317147 21 main.go:139] 2023-11-22T09:45:21Z INFO Epoch[0] Batch [300-400] Speed: 8653.84 samples/sec accuracy=0.892813 I1122 09:45:21.893307 21 main.go:139] 2023-11-22T09:45:21Z INFO Epoch[0] Batch [400-500] Speed: 11108.21 samples/sec accuracy=0.907188 I1122 09:45:22.529823 21 main.go:139] 2023-11-22T09:45:22Z INFO Epoch[0] Batch [500-600] Speed: 10061.93 samples/sec accuracy=0.892969 I1122 09:45:23.233631 21 main.go:139] 2023-11-22T09:45:23Z INFO Epoch[0] Batch [600-700] Speed: 9106.64 samples/sec accuracy=0.901250 I1122 09:45:23.743594 21 main.go:139] 2023-11-22T09:45:23Z INFO Epoch[0] Batch [700-800] Speed: 12528.56 samples/sec accuracy=0.896875 I1122 09:45:24.311487 21 main.go:139] 2023-11-22T09:45:24Z INFO Epoch[0] Batch [800-900] Speed: 11280.20 samples/sec accuracy=0.901563 I1122 09:45:24.444156 21 main.go:139] 2023-11-22T09:45:24Z INFO Epoch[0] Train-accuracy=0.878531 I1122 09:45:24.444248 21 main.go:139] 2023-11-22T09:45:24Z INFO Epoch[0] Time cost=5.185 I1122 09:45:24.667477 21 main.go:139] 2023-11-22T09:45:24Z INFO Epoch[0] Validation-accuracy=0.907245 F1122 09:45:46.259588 21 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.100.8.221:65535: i/o timeout" ```

However, katib-db-manager is up and running. From logs below, we see that katib-db-manager has a hard time talking to the database itself.

katib-db-manager-0 pod's logs

``` ╰─$ kf logs katib-db-manager-0 -c katib-db-manager -f 130 ↵ 2023-11-22T08:44:52.297Z [pebble] HTTP API server listening on ":38813". 2023-11-22T08:44:52.297Z [pebble] Started daemon. 2023-11-22T08:47:59.413Z [pebble] GET /v1/plan?format=yaml 161.67µs 200 2023-11-22T08:47:59.417Z [pebble] POST /v1/layers 459.331µs 200 2023-11-22T08:47:59.680Z [pebble] POST /v1/services 250.852481ms 202 2023-11-22T08:47:59.732Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:47:59.745Z [katib-db-manager] I1122 08:47:59.745811 15 db.go:32] Using MySQL 2023-11-22T08:48:00.752Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.070723238s 200 2023-11-22T08:48:04.759Z [katib-db-manager] E1122 08:48:04.752333 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:09.754Z [katib-db-manager] E1122 08:48:09.754499 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:14.753Z [katib-db-manager] E1122 08:48:14.753136 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:19.749Z [katib-db-manager] E1122 08:48:19.749860 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:24.754Z [katib-db-manager] E1122 08:48:24.754108 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:29.751Z [katib-db-manager] E1122 08:48:29.751660 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:34.754Z [katib-db-manager] E1122 08:48:34.754112 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:39.749Z [katib-db-manager] E1122 08:48:39.749735 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:44.751Z [katib-db-manager] E1122 08:48:44.751812 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:49.755Z [katib-db-manager] E1122 08:48:49.755129 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:54.755Z [katib-db-manager] E1122 08:48:54.755890 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:59.751Z [katib-db-manager] E1122 08:48:59.751137 15 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:48:59.751Z [katib-db-manager] F1122 08:48:59.751353 15 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened. 2023-11-22T08:48:59.753Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255 2023-11-22T08:48:59.753Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-22T08:49:00.283Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:49:00.291Z [katib-db-manager] I1122 08:49:00.291683 20 db.go:32] Using MySQL 2023-11-22T08:49:05.298Z [katib-db-manager] E1122 08:49:05.298234 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:10.297Z [katib-db-manager] E1122 08:49:10.297747 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:15.295Z [katib-db-manager] E1122 08:49:15.295277 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:20.295Z [katib-db-manager] E1122 08:49:20.295061 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:25.296Z [katib-db-manager] E1122 08:49:25.296885 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:30.297Z [katib-db-manager] E1122 08:49:30.297648 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:35.294Z [katib-db-manager] E1122 08:49:35.294778 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:40.302Z [katib-db-manager] E1122 08:49:40.302537 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:45.302Z [katib-db-manager] E1122 08:49:45.302010 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:50.296Z [katib-db-manager] E1122 08:49:50.295829 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:49:55.297Z [katib-db-manager] E1122 08:49:55.297364 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:00.299Z [katib-db-manager] E1122 08:50:00.299450 20 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:00.299Z [katib-db-manager] F1122 08:50:00.299482 20 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened. 2023-11-22T08:50:00.301Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255 2023-11-22T08:50:00.301Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-22T08:50:00.820Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:50:00.825Z [katib-db-manager] I1122 08:50:00.825169 26 db.go:32] Using MySQL 2023-11-22T08:50:05.831Z [katib-db-manager] E1122 08:50:05.831504 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:10.830Z [katib-db-manager] E1122 08:50:10.830619 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:15.831Z [katib-db-manager] E1122 08:50:15.831916 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:20.832Z [katib-db-manager] E1122 08:50:20.832687 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:25.831Z [katib-db-manager] E1122 08:50:25.831669 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:30.830Z [katib-db-manager] E1122 08:50:30.830287 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:35.833Z [katib-db-manager] E1122 08:50:35.832970 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:40.830Z [katib-db-manager] E1122 08:50:40.830215 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:45.834Z [katib-db-manager] E1122 08:50:45.834781 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:50.832Z [katib-db-manager] E1122 08:50:50.832182 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:50:55.829Z [katib-db-manager] E1122 08:50:55.829946 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:00.833Z [katib-db-manager] E1122 08:51:00.833062 26 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:00.833Z [katib-db-manager] F1122 08:51:00.833090 26 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened. 2023-11-22T08:51:00.834Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255 2023-11-22T08:51:00.834Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-22T08:51:01.367Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:51:01.375Z [katib-db-manager] I1122 08:51:01.375353 32 db.go:32] Using MySQL 2023-11-22T08:51:06.378Z [katib-db-manager] E1122 08:51:06.378505 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:11.379Z [katib-db-manager] E1122 08:51:11.379054 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:16.379Z [katib-db-manager] E1122 08:51:16.379887 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:21.378Z [katib-db-manager] E1122 08:51:21.378435 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:26.382Z [katib-db-manager] E1122 08:51:26.382196 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:31.381Z [katib-db-manager] E1122 08:51:31.381184 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:36.383Z [katib-db-manager] E1122 08:51:36.383100 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:41.379Z [katib-db-manager] E1122 08:51:41.379286 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:46.381Z [katib-db-manager] E1122 08:51:46.381156 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:51.380Z [katib-db-manager] E1122 08:51:51.380327 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:51:56.378Z [katib-db-manager] E1122 08:51:56.378803 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:01.381Z [katib-db-manager] E1122 08:52:01.381104 32 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:01.381Z [katib-db-manager] F1122 08:52:01.381131 32 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened. 2023-11-22T08:52:01.382Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255 2023-11-22T08:52:01.382Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-22T08:52:01.931Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:52:01.937Z [katib-db-manager] I1122 08:52:01.937396 39 db.go:32] Using MySQL 2023-11-22T08:52:06.942Z [katib-db-manager] E1122 08:52:06.942269 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:11.942Z [katib-db-manager] E1122 08:52:11.942390 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:16.942Z [katib-db-manager] E1122 08:52:16.942928 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:21.942Z [katib-db-manager] E1122 08:52:21.942191 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:26.946Z [katib-db-manager] E1122 08:52:26.946266 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:31.945Z [katib-db-manager] E1122 08:52:31.945875 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:36.949Z [katib-db-manager] E1122 08:52:36.949689 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:41.941Z [katib-db-manager] E1122 08:52:41.941713 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:46.945Z [katib-db-manager] E1122 08:52:46.945358 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:51.942Z [katib-db-manager] E1122 08:52:51.942058 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:52:56.942Z [katib-db-manager] E1122 08:52:56.942284 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:01.942Z [katib-db-manager] E1122 08:53:01.942660 39 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:01.942Z [katib-db-manager] F1122 08:53:01.942765 39 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened. 2023-11-22T08:53:01.944Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255 2023-11-22T08:53:01.944Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-22T08:53:02.456Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:53:02.462Z [katib-db-manager] I1122 08:53:02.462338 44 db.go:32] Using MySQL 2023-11-22T08:53:07.466Z [katib-db-manager] E1122 08:53:07.466080 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:10.943Z [pebble] GET /v1/plan?format=yaml 429.068µs 200 2023-11-22T08:53:12.469Z [katib-db-manager] E1122 08:53:12.469373 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:17.466Z [katib-db-manager] E1122 08:53:17.466365 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:22.469Z [katib-db-manager] E1122 08:53:22.469226 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:27.467Z [katib-db-manager] E1122 08:53:27.467448 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:32.471Z [katib-db-manager] E1122 08:53:32.471646 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:37.468Z [katib-db-manager] E1122 08:53:37.468823 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:42.467Z [katib-db-manager] E1122 08:53:42.467445 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:47.468Z [katib-db-manager] E1122 08:53:47.468342 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:52.466Z [katib-db-manager] E1122 08:53:52.465966 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:53:57.473Z [katib-db-manager] E1122 08:53:57.472735 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:54:02.470Z [katib-db-manager] E1122 08:54:02.469959 44 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:54:02.470Z [katib-db-manager] F1122 08:54:02.469990 44 main.go:104] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened. 2023-11-22T08:54:02.472Z [pebble] Service "katib-db-manager" stopped unexpectedly with code 255 2023-11-22T08:54:02.472Z [pebble] Service "katib-db-manager" on-failure action is "restart", waiting ~500ms before restart (backoff 1) 2023-11-22T08:54:03.016Z [pebble] Service "katib-db-manager" starting: ./katib-db-manager 2023-11-22T08:54:03.020Z [katib-db-manager] I1122 08:54:03.020600 50 db.go:32] Using MySQL 2023-11-22T08:54:08.026Z [katib-db-manager] E1122 08:54:08.026124 50 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:54:13.025Z [katib-db-manager] E1122 08:54:13.025398 50 connection.go:40] Ping to Katib db failed: dial tcp 10.100.51.25:3306: connect: connection refused 2023-11-22T08:54:18.044Z [katib-db-manager] I1122 08:54:18.042132 50 init.go:27] Initializing v1beta1 DB schema 2023-11-22T08:54:18.050Z [katib-db-manager] I1122 08:54:18.050549 50 main.go:113] Start Katib manager: 0.0.0.0:6789 2023-11-22T08:58:07.099Z [pebble] GET /v1/plan?format=yaml 173.401µs 200 2023-11-22T09:03:30.999Z [pebble] GET /v1/plan?format=yaml 160.392µs 200 2023-11-22T09:08:01.169Z [pebble] GET /v1/plan?format=yaml 243.954µs 200 2023-11-22T09:12:52.018Z [pebble] GET /v1/plan?format=yaml 203.666µs 200 2023-11-22T09:18:03.036Z [pebble] GET /v1/plan?format=yaml 143.63µs 200 2023-11-22T09:22:20.945Z [pebble] GET /v1/plan?format=yaml 166.369µs 200 2023-11-22T09:27:40.746Z [pebble] GET /v1/plan?format=yaml 149.338µs 200 2023-11-22T09:32:57.568Z [pebble] GET /v1/plan?format=yaml 180.716µs 200 2023-11-22T09:37:26.881Z [pebble] GET /v1/plan?format=yaml 238.442µs 200 2023-11-22T09:43:06.819Z [pebble] GET /v1/plan?format=yaml 382.865µs 200 2023-11-22T09:49:01.062Z [pebble] GET /v1/plan?format=yaml 206.435µs 200 2023-11-22T09:53:32.310Z [pebble] GET /v1/plan?format=yaml 365.722µs 200 2023-11-22T09:57:35.483Z [pebble] GET /v1/plan?format=yaml 176.831µs 200 2023-11-22T10:02:42.040Z [pebble] GET /v1/plan?format=yaml 214.828µs 200 2023-11-22T10:07:16.546Z [pebble] GET /v1/plan?format=yaml 148.969µs 200 2023-11-22T10:12:06.438Z [pebble] GET /v1/plan?format=yaml 180.211µs 200 2023-11-22T10:17:53.290Z [pebble] GET /v1/plan?format=yaml 256.823µs 200 2023-11-22T10:21:09.667Z [katib-db-manager] [mysql] 2023/11/22 10:21:09 packets.go:122: closing bad idle connection: EOF 2023-11-22T10:21:09.668Z [katib-db-manager] [mysql] 2023/11/22 10:21:09 connection.go:158: driver: bad connection ```

Note that the following errors are present even in the healthy cluster

2023-11-22T15:33:40.738Z [container-agent] 2023-11-22 15:33:40 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection

I also noticed katib-db going to Maintenance state with message offline. It eventually went to active by itself. Looking at its logs:

katib-db-0 pod's logs

``` 2023-11-22T09:30:27.272Z [container-agent] 2023-11-22 09:30:27 WARNING juju-log No relation: certificates 2023-11-22T09:30:27.662Z [container-agent] 2023-11-22 09:30:27 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch) 2023-11-22T09:36:25.451Z [container-agent] 2023-11-22 09:36:25 ERROR juju-log Uncaught exception while in charm code: 2023-11-22T09:36:25.451Z [container-agent] Traceback (most recent call last): 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open 2023-11-22T09:36:25.451Z [container-agent] h.request(req.get_method(), req.selector, req.data, headers, 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1283, in request 2023-11-22T09:36:25.451Z [container-agent] self._send_request(method, url, body, headers, encode_chunked) 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request 2023-11-22T09:36:25.451Z [container-agent] self.endheaders(body, encode_chunked=encode_chunked) 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders 2023-11-22T09:36:25.451Z [container-agent] self._send_output(message_body, encode_chunked=encode_chunked) 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output 2023-11-22T09:36:25.451Z [container-agent] self.send(msg) 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 976, in send 2023-11-22T09:36:25.451Z [container-agent] self.connect() 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 272, in connect 2023-11-22T09:36:25.451Z [container-agent] self.sock.connect(self.socket_path) 2023-11-22T09:36:25.451Z [container-agent] FileNotFoundError: [Errno 2] No such file or directory 2023-11-22T09:36:25.451Z [container-agent] 2023-11-22T09:36:25.451Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-22T09:36:25.451Z [container-agent] 2023-11-22T09:36:25.451Z [container-agent] Traceback (most recent call last): 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1599, in _request_raw 2023-11-22T09:36:25.451Z [container-agent] response = self.opener.open(request, timeout=self.timeout) 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 519, in open 2023-11-22T09:36:25.451Z [container-agent] response = self._open(req, data) 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 536, in _open 2023-11-22T09:36:25.451Z [container-agent] result = self._call_chain(self.handle_open, protocol, protocol + 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain 2023-11-22T09:36:25.451Z [container-agent] result = func(*args) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 286, in http_open 2023-11-22T09:36:25.451Z [container-agent] return self.do_open(_UnixSocketConnection, req, # type:ignore 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open 2023-11-22T09:36:25.451Z [container-agent] raise URLError(err) 2023-11-22T09:36:25.451Z [container-agent] urllib.error.URLError: 2023-11-22T09:36:25.451Z [container-agent] 2023-11-22T09:36:25.451Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-22T09:36:25.451Z [container-agent] 2023-11-22T09:36:25.451Z [container-agent] Traceback (most recent call last): 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 786, in 2023-11-22T09:36:25.451Z [container-agent] main(MySQLOperatorCharm) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 441, in main 2023-11-22T09:36:25.451Z [container-agent] _emit_charm_event(charm, dispatcher.event_name) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 149, in _emit_charm_event 2023-11-22T09:36:25.451Z [container-agent] event_to_emit.emit(*args, **kwargs) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 342, in emit 2023-11-22T09:36:25.451Z [container-agent] framework._emit(event) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 839, in _emit 2023-11-22T09:36:25.451Z [container-agent] self._reemit(event_path) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 928, in _reemit 2023-11-22T09:36:25.451Z [container-agent] custom_handler(event) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 717, in _on_update_status 2023-11-22T09:36:25.451Z [container-agent] if self._handle_potential_cluster_crash_scenario(): 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 641, in _handle_potential_cluster_crash_scenario 2023-11-22T09:36:25.451Z [container-agent] state, role = self._mysql.get_member_state() 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 289, in wrapped_f 2023-11-22T09:36:25.451Z [container-agent] return self(f, *args, **kw) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 379, in __call__ 2023-11-22T09:36:25.451Z [container-agent] do = self.iter(retry_state=retry_state) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 325, in iter 2023-11-22T09:36:25.451Z [container-agent] raise retry_exc.reraise() 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 158, in reraise 2023-11-22T09:36:25.451Z [container-agent] raise self.last_attempt.result() 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result 2023-11-22T09:36:25.451Z [container-agent] return self.__get_result() 2023-11-22T09:36:25.451Z [container-agent] File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result 2023-11-22T09:36:25.451Z [container-agent] raise self._exception 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/tenacity/__init__.py", line 382, in __call__ 2023-11-22T09:36:25.451Z [container-agent] result = fn(*args, **kwargs) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/lib/charms/mysql/v0/mysql.py", line 1972, in get_member_state 2023-11-22T09:36:25.451Z [container-agent] output = self._run_mysqlcli_script( 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 705, in _run_mysqlcli_script 2023-11-22T09:36:25.451Z [container-agent] process = self.container.exec(command, timeout=timeout) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/model.py", line 2571, in exec 2023-11-22T09:36:25.451Z [container-agent] return self._pebble.exec( 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 2374, in exec 2023-11-22T09:36:25.451Z [container-agent] resp = self._request('POST', '/v1/exec', body=body) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1564, in _request 2023-11-22T09:36:25.451Z [container-agent] response = self._request_raw(method, path, query, headers, data) 2023-11-22T09:36:25.451Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1612, in _request_raw 2023-11-22T09:36:25.451Z [container-agent] raise ConnectionError(e.reason) 2023-11-22T09:36:25.451Z [container-agent] ops.pebble.ConnectionError: [Errno 2] No such file or directory 2023-11-22T09:36:25.451Z [container-agent] 2023-11-22 09:36:25 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection 2023-11-22T09:36:25.865Z [container-agent] 2023-11-22 09:36:25 ERROR juju.worker.uniter.operation runhook.go:180 hook "update-status" (via hook dispatching script: dispatch) failed: exit status 1 2023-11-22T09:36:27.244Z [container-agent] 2023-11-22 09:36:27 ERROR juju-log Uncaught exception while in charm code: 2023-11-22T09:36:27.244Z [container-agent] Traceback (most recent call last): 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1348, in do_open 2023-11-22T09:36:27.244Z [container-agent] h.request(req.get_method(), req.selector, req.data, headers, 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1283, in request 2023-11-22T09:36:27.244Z [container-agent] self._send_request(method, url, body, headers, encode_chunked) 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request 2023-11-22T09:36:27.244Z [container-agent] self.endheaders(body, encode_chunked=encode_chunked) 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders 2023-11-22T09:36:27.244Z [container-agent] self._send_output(message_body, encode_chunked=encode_chunked) 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output 2023-11-22T09:36:27.244Z [container-agent] self.send(msg) 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/http/client.py", line 976, in send 2023-11-22T09:36:27.244Z [container-agent] self.connect() 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 272, in connect 2023-11-22T09:36:27.244Z [container-agent] self.sock.connect(self.socket_path) 2023-11-22T09:36:27.244Z [container-agent] FileNotFoundError: [Errno 2] No such file or directory 2023-11-22T09:36:27.244Z [container-agent] 2023-11-22T09:36:27.244Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-22T09:36:27.244Z [container-agent] 2023-11-22T09:36:27.244Z [container-agent] Traceback (most recent call last): 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1599, in _request_raw 2023-11-22T09:36:27.244Z [container-agent] response = self.opener.open(request, timeout=self.timeout) 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 519, in open 2023-11-22T09:36:27.244Z [container-agent] response = self._open(req, data) 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 536, in _open 2023-11-22T09:36:27.244Z [container-agent] result = self._call_chain(self.handle_open, protocol, protocol + 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain 2023-11-22T09:36:27.244Z [container-agent] result = func(*args) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 286, in http_open 2023-11-22T09:36:27.244Z [container-agent] return self.do_open(_UnixSocketConnection, req, # type:ignore 2023-11-22T09:36:27.244Z [container-agent] File "/usr/lib/python3.10/urllib/request.py", line 1351, in do_open 2023-11-22T09:36:27.244Z [container-agent] raise URLError(err) 2023-11-22T09:36:27.244Z [container-agent] urllib.error.URLError: 2023-11-22T09:36:27.244Z [container-agent] 2023-11-22T09:36:27.244Z [container-agent] During handling of the above exception, another exception occurred: 2023-11-22T09:36:27.244Z [container-agent] 2023-11-22T09:36:27.244Z [container-agent] Traceback (most recent call last): 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/./src/charm.py", line 786, in 2023-11-22T09:36:27.244Z [container-agent] main(MySQLOperatorCharm) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 441, in main 2023-11-22T09:36:27.244Z [container-agent] _emit_charm_event(charm, dispatcher.event_name) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/main.py", line 149, in _emit_charm_event 2023-11-22T09:36:27.244Z [container-agent] event_to_emit.emit(*args, **kwargs) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 342, in emit 2023-11-22T09:36:27.244Z [container-agent] framework._emit(event) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 839, in _emit 2023-11-22T09:36:27.244Z [container-agent] self._reemit(event_path) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/framework.py", line 928, in _reemit 2023-11-22T09:36:27.244Z [container-agent] custom_handler(event) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/src/rotate_mysql_logs.py", line 50, in _rotate_mysql_logs 2023-11-22T09:36:27.244Z [container-agent] self.charm._mysql._execute_commands(["logrotate", "-f", LOG_ROTATE_CONFIG_FILE]) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/src/mysql_k8s_helpers.py", line 622, in _execute_commands 2023-11-22T09:36:27.244Z [container-agent] process = self.container.exec( 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/model.py", line 2571, in exec 2023-11-22T09:36:27.244Z [container-agent] return self._pebble.exec( 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 2374, in exec 2023-11-22T09:36:27.244Z [container-agent] resp = self._request('POST', '/v1/exec', body=body) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1564, in _request 2023-11-22T09:36:27.244Z [container-agent] response = self._request_raw(method, path, query, headers, data) 2023-11-22T09:36:27.244Z [container-agent] File "/var/lib/juju/agents/unit-katib-db-0/charm/venv/ops/pebble.py", line 1612, in _request_raw 2023-11-22T09:36:27.244Z [container-agent] raise ConnectionError(e.reason) 2023-11-22T09:36:27.244Z [container-agent] ops.pebble.ConnectionError: [Errno 2] No such file or directory 2023-11-22T09:36:27.419Z [container-agent] 2023-11-22 09:36:27 INFO juju.util.exec exec.go:209 run result: exit status 1 2023-11-22T09:36:27.427Z [container-agent] 2023-11-22 09:36:27 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "update-status" hook 2023-11-22T09:36:32.437Z [container-agent] 2023-11-22 09:36:32 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "update-status" hook 2023-11-22T09:36:33.507Z [container-agent] 2023-11-22 09:36:33 WARNING juju-log No relation: certificates 2023-11-22T09:36:34.771Z [container-agent] 2023-11-22 09:36:34 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch) 2023-11-22T09:37:35.923Z [container-agent] 2023-11-22 09:37:35 INFO juju-log Starting the log rotate manager 2023-11-22T09:37:35.941Z [container-agent] 2023-11-22 09:37:35 INFO juju-log Started log rotate manager process with PID 5269 2023-11-22T09:37:36.276Z [container-agent] 2023-11-22 09:37:36 INFO juju-log Setting up the logrotate configurations 2023-11-22T09:37:36.366Z [container-agent] 2023-11-22 09:37:36 INFO juju-log Adding pebble layer 2023-11-22T09:37:43.555Z [container-agent] 2023-11-22 09:37:43 INFO juju-log Unit workload member-state is offline with member-role unknown 2023-11-22T09:37:43.594Z [container-agent] 2023-11-22 09:37:43 INFO juju-log Attempting reboot from complete outage. 2023-11-22T09:37:47.575Z [container-agent] 2023-11-22 09:37:47 INFO juju.worker.uniter.operation runhook.go:186 ran "mysql-pebble-ready" hook (via hook dispatching script: dispatch) 2023-11-22T09:41:14.745Z [container-agent] 2023-11-22 09:41:14 INFO juju-log Unit workload member-state is online with member-role primary 2023-11-22T09:41:14.745Z [container-agent] 2023-11-22 09:41:14 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection 2023-11-22T09:41:16.739Z [container-agent] 2023-11-22 09:41:16 WARNING juju-log No relation: certificates 2023-11-22T09:41:17.148Z [container-agent] 2023-11-22 09:41:17 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch) ```

Here are also the full kubernetes logs. Not sure what we can conclude from the above. This could be a mysql-k8s charm issue but we can't be sure.

Daard commented 10 months ago

I have also faced similar issue with katib.

These are the last logs outputs from the katib ui trials:

I1123 08:46:42.370564 14 main.go:139] 2023-11-23T08:46:42Z INFO loss=0.49476566910743713 I1123 08:46:42.370576 14 main.go:139] 2023-11-23T08:46:42Z INFO categorical_accuracy=0.8163449764251709 I1123 08:46:42.370581 14 main.go:139] 2023-11-23T08:46:42Z INFO same_precision=0.6819637417793274 I1123 08:46:42.370621 14 main.go:139] 2023-11-23T08:46:42Z INFO same_recall=0.7400115728378296 I1123 08:46:42.370631 14 main.go:139] 2023-11-23T08:46:42Z INFO val_loss=0.4788762629032135 I1123 08:46:42.370650 14 main.go:139] 2023-11-23T08:46:42Z INFO val_categorical_accuracy=0.8287093043327332 I1123 08:46:42.370728 14 main.go:139] 2023-11-23T08:46:42Z INFO val_same_precision=0.6882216930389404 I1123 08:46:42.370845 14 main.go:139] 2023-11-23T08:46:42Z INFO val_same_recall=0.7760416865348816 F1123 08:47:06.643975 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.152.183.250:65535: i/o timeout"

orfeas-k commented 10 months ago

@Daard Could you let us know a bit more about the environment and deployment you had during the above error?

Daard commented 10 months ago

Environment

Used this guide to deploy charmed kubeflow.

Logs

Added logs from the trial already.

What I have done?

I have created custom TFJob which is running and completed successfully. But it does not work as katib experiment.

I have also tested several other experiments configurations from kubeflow/katib documentation. But they have the same behaviour.

@orfeas-k Do you need some additional logs for understanding the reason?

Daard commented 10 months ago

@orfeas-k After katib experiment deletion the trials remain in my namespace. And I can't delete them also. Even after deletion of all resources which are connected to the experiment (pods, experiments, suggestions).

orfeas-k commented 10 months ago

Could you post all logs from katib-db-manager-0 and katib-db-0 pods? You can use

kubectl -nkubeflow logs katib-db-manager-0  --all-containers
kubectl -nkubeflow logs katib-db-0  --all-containers

Daard commented 10 months ago

@orfeas-k Sure.

db-logs.txt db-manager-logs.txt

orfeas-k commented 10 months ago

Thank you @Daard. We would like to understand better who exactly is trying to contact katib-db-manager and fails. I think the trial pod has a metrics-collector container as well and we would like to see if it's this one that sends the request that timeouts. Could you post the logs from the trial pod using --all-containers?

Daard commented 10 months ago

@orfeas-k When I tried to get logs from kunectl I have got this:

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS     RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running    16 (5d15h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running    12 (5d16h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running    6 (5d16h ago)    7d23h
resell-lab-0                                       2/2     Running    0                20h
exp-resell-dd-grid-6b5bc56d94-tdt8k                1/1     Running    0                27m
exp-resell-dd-qp9fnnzt-worker-0                    2/2     Running    0                2m12s
exp-resell-dd-dmnkgt8f-worker-0                    1/2     NotReady   0                2m14s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-dd-dmnkgt8f-worker-0 > worker.txt
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector

But I can see logs from tfjob and they are similar to logs from kubeflow.katib.trial ui:

 I1123 10:53:59.433666      14 main.go:396] Trial Name: exp-resell-dd-qp9fnnzt 
 I1123 10:54:02.806239      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature label has a shape dim { 
 I1123 10:54:02.806274      14 main.go:139]   size: 1 
 I1123 10:54:02.806314      14 main.go:139] } 
 I1123 10:54:02.806333      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806387      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_agency has a shape dim { 
 I1123 10:54:02.806405      14 main.go:139]   size: 1 
 I1123 10:54:02.806419      14 main.go:139] } 
 I1123 10:54:02.806425      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806438      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_area_kitchen has a shape dim { 
 I1123 10:54:02.806450      14 main.go:139]   size: 1 
 I1123 10:54:02.806459      14 main.go:139] } 
 I1123 10:54:02.806465      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806474      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_area_living has a shape dim { 
 I1123 10:54:02.806489      14 main.go:139]   size: 1 
 I1123 10:54:02.806499      14 main.go:139] } 
 I1123 10:54:02.806505      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806513      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_area_total has a shape dim { 
 I1123 10:54:02.806518      14 main.go:139]   size: 1 
 I1123 10:54:02.806525      14 main.go:139] } 
 I1123 10:54:02.806541      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806557      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_building has a shape dim { 
 I1123 10:54:02.806567      14 main.go:139]   size: 1 
 I1123 10:54:02.806580      14 main.go:139] } 
 I1123 10:54:02.806584      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:02.806728      14 main.go:139] 2023-11-23T10:54:02Z INFO     Feature left_built_year has a shape dim { 
 I1123 10:54:02.806747      14 main.go:139]   size: 1 
 I1123 10:54:02.806762      14 main.go:139] } 
 I1123 10:54:02.806775      14 main.go:139] . Setting to DenseTensor. 
 I1123 10:54:20.135699      14 main.go:139] /usr/local/lib/python3.8/dist-packages/keras/src/engine/functional.py:639: UserWarning: Input dict contained keys ['left_page_id', 'left_site', 'right_page_id', 'right_site'] which did not match any model input. They will be ignored by the model. 
 I1123 10:54:20.135737      14 main.go:139]   inputs = self._flatten_to_reference_inputs(inputs) 
 I1123 10:54:20.135760      14 main.go:139] WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0300s vs `on_train_batch_end` time: 0.0376s). Check your callbacks. 
 I1123 10:54:23.616630      14 main.go:139] 2023-11-23T10:54:23Z INFO     epoch 1: 
 I1123 10:55:22.620524      14 main.go:139] 2023-11-23T10:55:22Z INFO     categorical_accuracy=0.8045838475227356 
 I1123 10:55:22.620544      14 main.go:139] 2023-11-23T10:55:22Z INFO     same_precision=0.6760939359664917 
 I1123 10:55:22.620563      14 main.go:139] 2023-11-23T10:55:22Z INFO     same_recall=0.727324903011322 
 I1123 10:55:22.620576      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_loss=0.5009434223175049 
 I1123 10:55:22.620695      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_categorical_accuracy=0.8238841891288757 
 I1123 10:55:22.620801      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_same_precision=0.607390284538269 
 I1123 10:55:22.620865      14 main.go:139] 2023-11-23T10:55:22Z INFO     val_same_recall=0.7758111953735352 
 I1123 10:55:24.845616      14 main.go:139] 2023-11-23T10:55:24Z INFO     epoch 27: 
 I1123 10:55:24.845646      14 main.go:139] 2023-11-23T10:55:24Z INFO     loss=0.519189178943634 
 I1123 10:55:24.845659      14 main.go:139] 2023-11-23T10:55:24Z INFO     categorical_accuracy=0.8094089031219482 
 I1123 10:55:24.845672      14 main.go:139] 2023-11-23T10:55:24Z INFO     same_precision=0.6472785472869873 
 I1123 10:55:24.845743      14 main.go:139] 2023-11-23T10:55:24Z INFO     same_recall=0.7418960332870483 
 I1123 10:55:24.845861      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_loss=0.49683240056037903 
 I1123 10:55:24.845900      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_categorical_accuracy=0.8262967467308044 
 I1123 10:55:24.845914      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_same_precision=0.6697459816932678 
 I1123 10:55:24.846021      14 main.go:139] 2023-11-23T10:55:24Z INFO     val_same_recall=0.7651715278625488 
 I1123 10:55:27.488730      14 main.go:139] 2023-11-23T10:55:27Z INFO     epoch 28: 
 I1123 10:55:27.488755      14 main.go:139] 2023-11-23T10:55:27Z INFO     loss=0.517208993434906 
 I1123 10:55:27.488795      14 main.go:139] 2023-11-23T10:55:27Z INFO     categorical_accuracy=0.8075995445251465 
 I1123 10:55:27.488806      14 main.go:139] 2023-11-23T10:55:27Z INFO     same_precision=0.6600853800773621 
 I1123 10:55:27.488857      14 main.go:139] 2023-11-23T10:55:27Z INFO     same_recall=0.7438364624977112 
 I1123 10:55:27.488940      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_loss=0.49721571803092957 
 I1123 10:55:27.489047      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_categorical_accuracy=0.8250904679298401 
 I1123 10:55:27.489056      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_same_precision=0.6697459816932678 
 I1123 10:55:27.489163      14 main.go:139] 2023-11-23T10:55:27Z INFO     val_same_recall=0.7591623067855835 
 I1123 10:55:29.691004      14 main.go:139] 2023-11-23T10:55:29Z INFO     epoch 29: 
 I1123 10:55:29.691056      14 main.go:139] 2023-11-23T10:55:29Z INFO     loss=0.5149072408676147 
 I1123 10:55:29.691077      14 main.go:139] 2023-11-23T10:55:29Z INFO     categorical_accuracy=0.8060916662216187 
 I1123 10:55:29.691171      14 main.go:139] 2023-11-23T10:55:29Z INFO     same_precision=0.6648879647254944 
 I1123 10:55:29.691262      14 main.go:139] 2023-11-23T10:55:29Z INFO     same_recall=0.7316500544548035 
 I1123 10:55:29.691357      14 main.go:139] 2023-11-23T10:55:29Z INFO     val_loss=0.49219810962677

orfeas-k commented 10 months ago

Hmm could you try and rerun a Katib experiment and provide logs using --all-containers while the pod is up and in error state? IIRC, once the Katib experiment has completed running, the trial's pod is deleted, thus you cannot see the pod in your namespace.

Daard commented 10 months ago

Do I understand correctly? The trial pod it is separate pod and is not worker pod which was built by me and has such metrics?

 I1123 10:54:23.616630      14 main.go:139] 2023-11-23T10:54:23Z INFO     epoch 1: 
 I1123 10:54:23.616663      14 main.go:139] 2023-11-23T10:54:23Z INFO     loss=1.7035995721817017 
 I1123 10:54:23.616686      14 main.go:139] 2023-11-23T10:54:23Z INFO     categorical_accuracy=0.5815742015838623 
 I1123 10:54:23.616696      14 main.go:139] 2023-11-23T10:54:23Z INFO     same_precision=0.46905016899108887 
 I1123 10:54:23.616722      14 main.go:139] 2023-11-23T10:54:23Z INFO     same_recall=0.40713292360305786 
 I1123 10:54:23.616879      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_loss=99.96876525878906 
 I1123 10:54:23.616904      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_categorical_accuracy=0.6779252290725708 
 I1123 10:54:23.617044      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_same_precision=0.18244802951812744 
 I1123 10:54:23.617147      14 main.go:139] 2023-11-23T10:54:23Z INFO     val_same_recall=0.585185170173645

Cause during the experiment is running there are only such pods:

ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running    16 (5d15h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running    12 (5d16h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running    6 (5d16h ago)    7d23h
resell-lab-0                                       2/2     Running    0                20h
exp-resell-dd-grid-6b5bc56d94-tdt8k                1/1     Running    0                27m
exp-resell-dd-qp9fnnzt-worker-0                    2/2     Running    0                2m12s
exp-resell-dd-dmnkgt8f-worker-0                    1/2     NotReady   0                2m14s

There are also trials like this:

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs trial.kubeflow.org/exp-resell-dd-qp9fnnzt  
error: no kind "Trial" is registered for version "kubeflow.org/v1beta1" in scheme "pkg/scheme/scheme.go:28"

orfeas-k commented 10 months ago

@Daard I 'll get back to your previous comment soon. In the meantime, could you post the output of juju status katib-db command?

Daard commented 10 months ago

juju status katib-db

Model     Controller     Cloud/Region     Version  SLA          Timestamp
kubeflow  my-controller  myk8s/localhost  3.1.6    unsupported  13:10:12Z

App       Version                  Status  Scale  Charm      Channel     Rev  Address        Exposed  Message
katib-db  8.0.34-0ubuntu0.22.04.1  active      1  mysql-k8s  8.0/stable   99  10.152.183.35  no       

Unit         Workload  Agent  Address      Ports  Message
katib-db/0*  active    idle   10.1.183.72         Primary

orfeas-k commented 10 months ago

Thank you for your effort in debugging this @Daard. The trial pods are pods spun up by the experiment while it's running. IIUC, in your case the experiment pod is exp-resell-dd-grid-6b5bc56d94-tdt8k and the trial pods are:

exp-resell-dd-qp9fnnzt-worker-0                    2/2     Running    0                2m12s
exp-resell-dd-dmnkgt8f-worker-0                    1/2     NotReady   0                2m14s

Are those completing successfully or go into Error status? We want the logs from those once they are in Error state.

Daard commented 10 months ago

I will restart experiment soon. But it needs 20-30 minutes to start my workers. Is it normal behaviour by the way? Cause tfjob runs almost instantly.

After trial is failed my workers are gone and in the ui I can see such message:

Failed to find logs for this Trial. 
Make sure you've set "spec.trialTemplate.retain" field to "true" in the Experiment definition.
If this error persists then the Pod's logs are not currently persisted in the cluster.

The yaml output says this:

- type: Failed
      status: 'True'
      reason: 'TrialFailed. Job reason: TFJobFailed'
      message: >-
        Trial has failed. Job message: TFJob my-namespace/exp-resell-dd-dmnkgt8f
        has failed because 1 Worker replica(s) failed.
      lastUpdateTime: '2023-11-23T11:02:53Z'
      lastTransitionTime: '2023-11-23T11:02:53Z'

I will try to increase replicas count. Maybe it will help to get error logs.

Daard commented 10 months ago

I have got some logs:

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS      RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running     16 (5d19h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running     12 (5d19h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running     6 (5d19h ago)    8d
exp-resell-ee-grid-667cfd4548-btbnh                1/1     Running     0                24m
exp-resell-ee-pdrnc575-worker-1                    0/1     Completed   0                3m3s
exp-resell-ee-2pxhvm8q-worker-1                    1/1     Running     0                2m59s
exp-resell-ee-pdrnc575-worker-0                    0/2     Error       2 (28s ago)      3m4s
exp-resell-ee-2pxhvm8q-worker-0                    0/2     Error       3 (30s ago)      3m1s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-ee-pdrnc575-worker-0
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector

NAME                                               READY   STATUS             RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running            16 (5d19h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running            12 (5d19h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running            6 (5d19h ago)    8d
exp-resell-ee-grid-667cfd4548-btbnh                1/1     Running            0                25m
exp-resell-ee-pdrnc575-worker-1                    0/1     Completed          0                3m47s
exp-resell-ee-2pxhvm8q-worker-1                    1/1     Running            0                3m43s
exp-resell-ee-pdrnc575-worker-0                    0/2     CrashLoopBackOff   3 (30s ago)      3m48s
exp-resell-ee-2pxhvm8q-worker-0                    0/2     Error              4 (48s ago)      3m45s
(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace logs exp-resell-ee-pdrnc575-worker-0
Defaulted container "tensorflow" out of: tensorflow, metrics-logger-and-collector

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS             RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running            16 (5d19h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running            12 (5d19h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running            6 (5d19h ago)    8d
exp-resell-ee-grid-667cfd4548-btbnh                1/1     Running            0                28m
exp-resell-ee-pdrnc575-worker-1                    0/1     Completed          0                7m36s
exp-resell-ee-2pxhvm8q-worker-1                    0/1     Completed          0                7m32s
exp-resell-ee-2pxhvm8q-worker-0                    0/2     CrashLoopBackOff   5 (2m25s ago)    7m34s
exp-resell-ee-pdrnc575-worker-0                    0/2     CrashLoopBackOff   5 (2m16s ago)    7m37s

After I increased replicas count some trials were completed. They outputs is similar to tfjob log.

But inside UI I got such output in the end:

I1123 14:38:53.968930    5790 main.go:139] 2023-11-23T14:32:41Z INFO     epoch 31: 
 I1123 14:38:53.968937    5790 main.go:139] 2023-11-23T14:32:41Z INFO     loss=0.5217770934104919 
 I1123 14:38:53.968944    5790 main.go:139] 2023-11-23T14:32:41Z INFO     categorical_accuracy=0.8083534240722656 
 I1123 14:38:53.968949    5790 main.go:139] 2023-11-23T14:32:41Z INFO     same_precision=0.6734258532524109 
 I1123 14:38:53.968956    5790 main.go:139] 2023-11-23T14:32:41Z INFO     same_recall=0.7432273030281067 
 I1123 14:38:53.968963    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_loss=0.5091827511787415 
 I1123 14:38:53.968975    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_categorical_accuracy=0.8226779103279114 
 I1123 14:38:53.968989    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_same_precision=0.6974595785140991 
 F1123 14:38:53.968994    5790 main.go:421] Failed to wait for worker container: unable to find main pid from the process list [1 5790] 
 I1123 14:38:53.969127    5790 main.go:139] 2023-11-23T14:32:41Z INFO     val_same_recall=0.7512437701225281 
 I1123 14:38:53.969137    5790 main.go:139] 2023-11-23T14:32:43Z INFO     epoch 32:

Now the experiment is stuck.

orfeas-k commented 10 months ago

What we need is logs from pods that are in Error state using --all-containers meaning to run kubectl -n my-namespace logs pod-name --all-containers.

Daard commented 10 months ago

I did not catch error state of the worker only crashloop.

Is it crucial? Or you may help with such logs?

(base) larion@flairmonster1:~/LUN/dockers/resell-trainer$ kubectl -n my-namespace get pods
NAME                                               READY   STATUS             RESTARTS         AGE
ml-pipeline-ui-artifact-c4969b95b-6bj86            2/2     Running            16 (5d20h ago)   16d
ml-pipeline-visualizationserver-677c86b748-gqw4k   2/2     Running            12 (5d21h ago)   16d
tboard-77d56648ff-dqfz8                            2/2     Running            6 (5d21h ago)    8d
exp-resell-ff-grid-77955875cb-sl7j4                1/1     Running            0                40m
resell-lab-0                                       2/2     Running            0                16m
exp-resell-ff-5crjfw69-worker-1                    0/1     Pending            0                10m
exp-resell-ff-cc4mjp6j-worker-1                    0/1     Completed          0                10m
exp-resell-ff-cc4mjp6j-worker-0                    0/2     CrashLoopBackOff   6 (2m52s ago)    11m
exp-resell-ff-5crjfw69-worker-0                    0/2     CrashLoopBackOff   6 (22s ago)      10m

kubectl -n my-namespace logs exp-resell-ff-cc4mjp6j-worker-0 --all-containers > error.logs error.txt

Daard commented 10 months ago

I have caught error state.

err2.txt

Daard commented 10 months ago

@orfeas-k Hello again. I am not familiar with k8s and kubeflow, thus did not notice that I can add --all-containers arg and it would get some impact. But I have added all logs from error containers in the err2.txt file. Hope it will help to debug katib.

I have faced similar issue with this example .

The trial pods are now stuck in pending state.

I hope you will find the problem. If you need any additional logs I will send easily.

orfeas-k commented 10 months ago

@Daard could you try removing the katib-db charm and redeploying it with the following commands? Let's see if this will unblock new experiments. Please note that this will delete any data related to Katib experiments you 've already ran, so you should avoid doing that if you have any important data for ran experiments.

juju remove-application katib-db 
# Wait for it to be removed
juju deploy mysql-k8s katib-db --channel 8.0/stable --trust --constraints mem=2G
juju relate katib-db-manager:relational-db katib-db:database

Daard commented 10 months ago

@orfeas-k Is it normal? juju_status.txt

Daard commented 10 months ago

The logs from katib-db-manager:

2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     data = self._lazy_data = self._load()
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 1378, in _load
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     return self._backend.relation_get(self.relation.id, self._entity.name, self._is_app)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 2697, in relation_get
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     raw_data_content = self._run(*args, return_output=True, use_json=True)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken   File "/var/lib/juju/agents/unit-katib-db-manager-0/charm/venv/ops/model.py", line 2618, in _run
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken     raise ModelError(e.stderr)
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken ops.model.ModelError: ERROR permission denied
2023-11-24T13:05:31.562Z [container-agent] 2023-11-24 13:05:31 WARNING relational-db-relation-broken 
2023-11-24T13:05:31.810Z [container-agent] 2023-11-24 13:05:31 ERROR juju.worker.uniter.operation runhook.go:180 hook "relational-db-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1
2023-11-24T13:05:31.812Z [container-agent] 2023-11-24 13:05:31 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.228Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.228Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:05:37.299Z [container-agent] 2023-11-24 13:05:37 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook
2023-11-24T13:09:27.243Z [container-agent] 2023-11-24 13:09:27 INFO juju.worker.uniter resolver.go:161 awaiting error resolution for "relation-broken" hook

orfeas-k commented 9 months ago

Filed an issue in mysql-k8s https://github.com/canonical/mysql-k8s-operator/issues/341

orfeas-k commented 9 months ago

Update

Received a response https://github.com/canonical/mysql-k8s-operator/issues/341#issuecomment-1836519425 that mentions that mysql-k8s has an issue scaling up after it was scaled to 0 units and they 're working on a solution. If the root cause of the issue in Katib is katib-db not being responsive to katib-db-manager, then it will be resolved once they push a fix.

orfeas-k commented 3 months ago

After all, this should be the same issue we hit https://github.com/canonical/bundle-kubeflow/issues/893 and is described in detail in https://github.com/canonical/bundle-kubeflow/issues/893#issuecomment-2142022677.

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5873.

This message was autogenerated

NohaIhab commented 3 months ago

Hit this issue while defining tests for CKF 1.8/stable in air-gapped (https://github.com/canonical/bundle-kubeflow/issues/918), PR #192 should resolve it.

canonical / katib-operators