Closed orfeas-k closed 3 weeks ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5649.
This message was autogenerated
Deployed a 1.28 AKS cluster and Juju 3.5 with the same configuration with our CI and tried ran the Katib UAT. I could not find really useful logs that indicate the issue:
╰─$ kl -n test-kubeflow cmaes-example-cmaes-589674bc6f-phff2 --all-containers -f
I0529 12:05:25.927601 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0529 12:05:56.169166 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396" ]
I0529 12:05:56.169233 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728" ]
╰─$ kl -n test-kubeflow cmaes-example-hdnn4bl5-jmfrx --all-containers -f
I0529 12:05:57.870325 14 main.go:396] Trial Name: cmaes-example-hdnn4bl5
100.0%12:06:00.184060 14 main.go:139]
100.0%12:06:00.689822 14 main.go:139]
100.0%12:06:00.885008 14 main.go:139]
100.0%12:06:01.057909 14 main.go:139]
I0529 12:06:01.185685 14 main.go:139] 2024-05-29T12:06:01Z INFO Train Epoch: 1 [0/60000 (0%)] loss=2.2980
I0529 12:06:01.606906 14 main.go:139] 2024-05-29T12:06:01Z INFO Train Epoch: 1 [640/60000 (1%)] loss=2.2849
I0529 12:06:01.932049 14 main.go:139] 2024-05-29T12:06:01Z INFO Train Epoch: 1 [1280/60000 (2%)] loss=2.0193
... # more training
I0529 12:06:30.235188 14 main.go:139] 2024-05-29T12:06:30Z INFO Train Epoch: 1 [59520/60000 (99%)] loss=0.3112
I0529 12:06:32.830804 14 main.go:139] 2024-05-29T12:06:32Z INFO {metricName: accuracy, metricValue: 0.8480};{metricName: loss, metricValue: 0.4272}
I0529 12:06:32.830912 14 main.go:139]
I0529 12:06:32.830931 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
I0529 12:06:32.830936 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
I0529 12:06:32.830974 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.830990 14 main.go:139]
I0529 12:06:32.831016 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
I0529 12:06:32.831025 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
I0529 12:06:32.831032 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831036 14 main.go:139]
I0529 12:06:32.831044 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831047 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831054 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831058 14 main.go:139]
I0529 12:06:32.831065 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831069 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831077 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831081 14 main.go:139]
F0529 12:06:54.922852 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"
The only log that points somewhere is the last line where it cannot report logs and fails to dial the katib-db-manager (10.0.35.173 is the IP of its k8s service).
6789
but probably this is a default GRPC port (?) since this is not an issue in different k8s deploymentskatib-db
or katib-db-manager
not functioning.
katib-db
was some "readiness" failures
2024-05-29T09:51:40.163Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 418
After talking to DP team though, this looks like a pebble issue and since they are gone after mysql-pebble-ready
event, the cluster should be healthy which was also confirmed by running juju run katib-db/0 get-cluster-status
.
That being said though, as #108, most of the upstream issues mention that this is related to issues in katib-db or katib-db-manager (see https://github.com/kubeflow/katib/issues/1537, https://github.com/kubeflow/katib/issues/1795)
We ran the uat on Microk8s 1.28 (juju 3.5.0) and the UAT failed with the same error message.
65535
(without juju config
), the trials could report metrics.katib-controller
pod to clear it logs and without doing any configuration, the UAT started succeeding from that point on. Not sure if killing the pod had something to do, but this could show that there is a race happening, resulting in the UAT failingIt turns out that ineed the training container succeeds while the metrics-collector copmonent fails to report metrics and thus the trial fails. The reason this is failing is indeed the last line of the logs
F0529 12:06:54.922852 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"
In this line, we see that it is trying to hit 65535
port instead of the default 6789. This was confirmed also by comparing the trial pod yaml and more specifically the args passed to the metrics container between cases when the trial was successful and when it wasn't:
- args:
- -t
- cmaes-example-fqgkd7hd
- -m
- loss;Train-accuracy
- -o-type
- minimize
- -s-db
- - katib-db-manager.kubeflow:6789
+ - katib-db-manager.kubeflow:65535
- -path
- /var/log/katib/metrics.log
- -format
- TEXT
From the above, it looks like when the trial is created, those args
are applied to the metrics collector container of the pod.
Looking at the upstream code:
KATIB_DB_MANAGER_SERVICE_PORT
or defaults to 6789 when there is no variable.Now what does not make sense is that nowhere in upstream Katib manifests or in katib-operator repository is a KATIB_DB_MANAGER_SERVICE_PORT
env variable set.
SSHing into the charms, it turns out that there are multiple katib-related ENV variables set
╰─$ juju ssh katib-db-manager/0
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142
# exit
╰─$ juju ssh katib-controller/0
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142
which explains why the trials' metric-collector container is misconfigured.
Regarding the previous comment, I killed the katib-manager-0
and when it was spun up, those variables were different, which explains why the UAT succeeded yesterday after killing the pod
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=6789
KATIB_DB_MANAGER_SERVICE_PORT_API=6789
KATIB_DB_MANAGER_SERVICE_HOST=10.152.183.120
Copying from Kubernetes_service_patch (KSP) library:
When sidecar charms are deployed, Juju creates a service named after the application in the namespace. This service by default contains a "placeholder" port, which is 65535/TCP.
This ties with the kubernetes native behavior of having container environment variables for every service present on the cluster:
A list of all services that were running when a Container was created is available to that Container as environment variables. This list is limited to services within the same namespace as the new Container's Pod and Kubernetes control plane services.
Thus, when a charm with a pebble service called test-charm
is deployed, the following ENV
variables are expected to be found in other charm's containers that are created after test-charm
:
TEST_CHARM_SERVICE_PORT=<port>
TEST_CHARM_SERVICE_HOST=<test-charm-service-ip>
Note that those ENV variables are not updated after the container has been created.
KATIB_DB_MANAGER_SERVICE_PORT
Copying and rephrasing from two comments above:
args
to the metrics collector container of the pod. KATIB_DB_MANAGER_SERVICE_PORT
in order to set the address' port. If not present, it defaults to 6789 when there is no variable.We can see now how the above two work in combination with each other. The result is a race where:
katib-db-manager
service is created but not patched yet (thus its service has a port 65535
)katib-controller
container is created with ENV var KATIB_DB_MANAGER_SERVICE_PORT=65535
.katib-db-manager
service is patched but this information is not propagated to containers already created (thus to katib-controller
containers as well)65535
. This results in F0529 12:06:54.922852 14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"
The weird thing is that this is intermittent and sometimes this ENV
variable has the proper value and sometimes not.
Until we figure out more on this behaviour and maybe tackle it in a better way, we 'll explicitly set KATIB_DB_MANAGER_SERVICE_PORT
. This information will be broadcasted to katib-controller through a new relation between the charm and katib-db-manager
. For this, k8s_service_info library will be used.
Bug Description
This UAT fails for bundle
latest/edge
in our CI withAssertionError: Katib Experiment was not successful.
. Note that this is successful for bundle1.8/stable
. Unfortunately, we cannot have detailed logs since that's a known limitation of how our UATs run.Example runs: first-k8s-1.28, second-k8s-1.28, k8s-1.26
To Reproduce
Rerun the CI from #892 or just from
main
for "latest/edge"Environment
AKS 1.26 and 1.28 Juju 3.1 CKF
latest/edge
Relevant Log Output
Additional Context
No response