orfeas-k commented 1 month ago

Bug Description

This UAT fails for bundle latest/edge in our CI with AssertionError: Katib Experiment was not successful.. Note that this is successful for bundle 1.8/stable. Unfortunately, we cannot have detailed logs since that's a known limitation of how our UATs run.

Example runs: first-k8s-1.28, second-k8s-1.28, k8s-1.26

To Reproduce

Rerun the CI from #892 or just from main for "latest/edge"

Environment

AKS 1.26 and 1.28 Juju 3.1 CKF latest/edge

Model     Controller      Cloud/Region    Version  SLA          Timestamp
kubeflow  aks-controller  aks/westeurope  3.1.8    unsupported  09:26:07Z

App                        Version                  Status  Scale  Charm                    Channel       Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        latest/edge   308  10.0.16.94    no       
argo-controller                                     active      1  argo-controller          latest/edge   468  10.0.100.236  no       
dex-auth                                            active      1  dex-auth                 latest/edge   458  10.0.254.87   no       
envoy                                               active      1  envoy                    latest/edge   183  10.0.245.125  no       
istio-ingressgateway                                active      1  istio-gateway            latest/edge   900  10.0.44.117   no       
istio-pilot                                         active      1  istio-pilot              latest/edge   872  10.0.21.240   no       
jupyter-controller                                  active      1  jupyter-controller       latest/edge   936  10.0.131.139  no       
jupyter-ui                                          active      1  jupyter-ui               latest/edge   856  10.0.90.58    no       
katib-controller                                    active      1  katib-controller         latest/edge   526  10.0.152.253  no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.152.2    no       
katib-db-manager                                    active      1  katib-db-manager         latest/edge   490  10.0.236.4    no       
katib-ui                                            active      1  katib-ui                 latest/edge   501  10.0.92.22    no       
kfp-api                                             active      1  kfp-api                  latest/edge  1244  10.0.176.102  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.3.211    no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      latest/edge   298  10.0.201.207  no       
kfp-persistence                                     active      1  kfp-persistence          latest/edge  1251  10.0.6.212    no       
kfp-profile-controller                              active      1  kfp-profile-controller   latest/edge  1209  10.0.253.135  no       
kfp-schedwf                                         active      1  kfp-schedwf              latest/edge  1263  10.0.6.119    no       
kfp-ui                                              active      1  kfp-ui                   latest/edge  1246  10.0.221.196  no       
kfp-viewer                                          active      1  kfp-viewer               latest/edge  1276  10.0.137.58   no       
kfp-viz                                             active      1  kfp-viz                  latest/edge  1197  10.0.127.237  no       
knative-eventing                                    active      1  knative-eventing         latest/edge   393  10.0.110.159  no       
knative-operator                                    active      1  knative-operator         latest/edge   368  10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:15)5.205  no       
knative-serving                                     active      1  knative-serving          latest/edge   394  10.0.147.68   no       
kserve-controller                                   active      1  kserve-controller        latest/edge   538  10.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:16)4.87   no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       latest/edge   517  10.0.52.98    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        latest/edge   379  10.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:17)4.223  no       
kubeflow-roles                                      active      1  kubeflow-roles           latest/edge   207  10.0.205.101  no       
kubeflow-volumes                                    active      1  kubeflow-volumes         latest/edge   279  10.0.83.113   no       
metacontroller-operator                             active      1  metacontroller-operator  latest/edge   280  10.0.153.8    no       
minio                      res:oci-image@[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:18)55999    active      1  minio                    latest/edge   306  10.0.52.197   no       
mlmd                                                active      1  mlmd                     latest/edge   174  10.0.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:19)8.218  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          latest/edge   371  10.0.125.250  no       
pvcviewer-operator                                  active      1  pvcviewer-operator       latest/edge    74  10.0.97.108   no       
seldon-controller-manager                           active      1  seldon-core              latest/edge   691  10.0.87.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:20)5   no       
tensorboard-controller                              active      1  tensorboard-controller   latest/edge   281  10.0.30.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:21)1   no       
tensorboards-web-app                                active      1  tensorboards-web-app     latest/edge   269  10.0.24.183   no       
training-operator                                   active      1  training-operator        latest/edge   378  10.0.16.237   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.0.7                  
argo-controller/0*            active    idle   10.244.1.9                  
dex-auth/0*                   active    idle   10.244.0.8                  
envoy/0*                      active    idle   10.244.1.11                 
istio-ingressgateway/0*       active    idle   10.244.1.10                 
istio-pilot/0*                active    idle   10.244.0.9                  
jupyter-controller/0*         active    idle   10.244.1.12                 
jupyter-ui/0*                 active    idle   10.244.1.14                 
katib-controller/0*           active    idle   10.244.1.15                 
katib-db-manager/0*           active    idle   10.244.1.16                 
katib-db/0*                   active    idle   10.244.0.12                 Primary
katib-ui/0*                   active    idle   10.244.1.17                 
kfp-api/0*                    active    idle   10.244.1.18                 
kfp-db/0*                     active    idle   10.244.1.19                 Primary
kfp-metadata-writer/0*        active    idle   10.244.0.13                 
kfp-persistence/0*            active    idle   10.244.0.15                 
kfp-profile-controller/0*     active    idle   10.244.0.16                 
kfp-schedwf/0*                active    idle   10.244.0.18                 
kfp-ui/0*                     active    idle   10.244.1.20                 
kfp-viewer/0*                 active    idle   10.244.0.19                 
kfp-viz/0*                    active    idle   10.244.0.20                 
knative-eventing/0*           active    idle   10.244.0.14                 
knative-operator/0*           active    idle   10.244.0.22                 
knative-serving/0*            active    idle   10.244.0.17                 
kserve-controller/0*          active    idle   10.244.1.25                 
kubeflow-dashboard/0*         active    idle   10.244.1.23                 
kubeflow-profiles/0*          active    idle   10.244.0.24                 
kubeflow-roles/0*             active    idle   10.244.1.[21](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:22)                 
kubeflow-volumes/0*           active    idle   10.244.0.21                 
metacontroller-operator/0*    active    idle   10.244.1.[22](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:23)                 
minio/0*                      active    idle   10.244.1.24  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.1.28                 
oidc-gatekeeper/0*            active    idle   10.244.0.[23](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:24)                 
pvcviewer-operator/0*         active    idle   10.244.0.26                 
seldon-controller-manager/0*  active    idle   10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:25)4.1.26                 
tensorboard-controller/0*     active    idle   10.244.1.27                 
tensorboards-web-app/0*       active    idle   10.244.0.[25](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:26)                 
training-operator/0*          active    idle   10.244.1.29

Relevant Log Output

------------------------------ Captured log call -------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running katib-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[8], line 8, in assert_experiment_succeeded(client, experiment)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=10),
      3     stop=stop_after_attempt(30),
      4     reraise=True,
      5 )
      6 def assert_experiment_succeeded(client, experiment):
      7     """Wait for the Katib Experiment to complete successfully."""
----> 8     assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
_____________________ test_notebook[training-integration] ______________________

test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/training/training-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running ***os.path.basename(test_notebook)***...")
            output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")

        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Job pytorch-dist-mnist-gloo was not successful.

/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/test_notebooks.py:59: Failed

Additional Context

No response

syncronize-issues-to-jira[bot] commented 1 month ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5649.

This message was autogenerated

orfeas-k commented 1 month ago

Debugging

Deployed a 1.28 AKS cluster and Juju 3.5 with the same configuration with our CI and tried ran the Katib UAT. I could not find really useful logs that indicate the issue:

Experiment logs

╰─$ kl -n test-kubeflow cmaes-example-cmaes-589674bc6f-phff2 --all-containers -f
I0529 12:05:25.927601       1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0529 12:05:56.169166       1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654"  name:"momentum" value:"0.7043612817216396" ]
I0529 12:05:56.169233       1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099"  name:"momentum" value:"0.6980954001565728" ]

Trial logs

    ╰─$ kl -n test-kubeflow cmaes-example-hdnn4bl5-jmfrx --all-containers -f
I0529 12:05:57.870325      14 main.go:396] Trial Name: cmaes-example-hdnn4bl5
100.0%12:06:00.184060      14 main.go:139] 
100.0%12:06:00.689822      14 main.go:139] 
100.0%12:06:00.885008      14 main.go:139] 
100.0%12:06:01.057909      14 main.go:139] 
I0529 12:06:01.185685      14 main.go:139] 2024-05-29T12:06:01Z INFO     Train Epoch: 1 [0/60000 (0%)]  loss=2.2980
I0529 12:06:01.606906      14 main.go:139] 2024-05-29T12:06:01Z INFO     Train Epoch: 1 [640/60000 (1%)]    loss=2.2849
I0529 12:06:01.932049      14 main.go:139] 2024-05-29T12:06:01Z INFO     Train Epoch: 1 [1280/60000 (2%)]   loss=2.0193
... # more training
I0529 12:06:30.235188      14 main.go:139] 2024-05-29T12:06:30Z INFO     Train Epoch: 1 [59520/60000 (99%)] loss=0.3112
I0529 12:06:32.830804      14 main.go:139] 2024-05-29T12:06:32Z INFO     {metricName: accuracy, metricValue: 0.8480};{metricName: loss, metricValue: 0.4272}
I0529 12:06:32.830912      14 main.go:139] 
I0529 12:06:32.830931      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
I0529 12:06:32.830936      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
I0529 12:06:32.830974      14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.830990      14 main.go:139] 
I0529 12:06:32.831016      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
I0529 12:06:32.831025      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
I0529 12:06:32.831032      14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831036      14 main.go:139] 
I0529 12:06:32.831044      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831047      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
I0529 12:06:32.831054      14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831058      14 main.go:139] 
I0529 12:06:32.831065      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831069      14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
I0529 12:06:32.831077      14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
I0529 12:06:32.831081      14 main.go:139] 
F0529 12:06:54.922852      14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"

The only log that points somewhere is the last line where it cannot report logs and fails to dial the katib-db-manager (10.0.35.173 is the IP of its k8s service).

One thing about this is that we 're not sure why it uses this port since the service's port is 6789 but probably this is a default GRPC port (?) since this is not an issue in different k8s deployments
This is the same error from https://github.com/canonical/katib-operators/issues/108. This was never resolved, although in some cases, the root cause was katib-db or katib-db-manager not functioning.
- On that, there are not worrisome logs on those applications.
- The only thing in katib-db was some "readiness" failures
```
2024-05-29T09:51:40.163Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 418
```
  After talking to DP team though, this looks like a pebble issue and since they are gone after mysql-pebble-ready event, the cluster should be healthy which was also confirmed by running juju run katib-db/0 get-cluster-status. That being said though, as #108, most of the upstream issues mention that this is related to issues in katib-db or katib-db-manager (see https://github.com/kubeflow/katib/issues/1537, https://github.com/kubeflow/katib/issues/1795)

For reference, `kubectl describe pod` for trial

```bash ╰─$ kdp -n admin cmaes-example-wc2gxp47-jhmlf Name: cmaes-example-wc2gxp47-jhmlf Namespace: admin Priority: 0 Service Account: default Node: aks-nodepool1-16255669-vmss000001/10.224.0.5 Start Time: Wed, 29 May 2024 15:25:24 +0300 Labels: batch.kubernetes.io/controller-uid=5e9504f4-1f09-4648-a8ec-eca78386719b batch.kubernetes.io/job-name=cmaes-example-wc2gxp47 controller-uid=5e9504f4-1f09-4648-a8ec-eca78386719b job-name=cmaes-example-wc2gxp47 katib.kubeflow.org/experiment=cmaes-example katib.kubeflow.org/trial=cmaes-example-wc2gxp47 Annotations: sidecar.istio.io/inject: false Status: Running IP: 10.244.1.76 IPs: IP: 10.244.1.76 Controlled By: Job/cmaes-example-wc2gxp47 Containers: training-container: Container ID: containerd://be81b0cc38fe14d3be4278caa65cc6764829a20214f5729ce43c665fa40e3eaa Image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0 Image ID: docker.io/kubeflowkatib/pytorch-mnist-cpu@sha256:b95678ce7c02cc8ece15fd2e8ed57241333ce8d79342ee68789f650b3671463d Port: Host Port: Command: sh -c Args: python3 /opt/pytorch-mnist/mnist.py --epochs=1 --batch-size=64 --lr=0.04511033252270099 --momentum=0.6980954001565728 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid State: Running Started: Wed, 29 May 2024 15:25:25 +0300 Ready: True Restart Count: 0 Environment: Mounts: /var/log/katib from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-brzwq (ro) metrics-logger-and-collector: Container ID: containerd://d77bbbc1853b584fb0a751990538044bdb90155b8cde73a592d0b6b39013648b Image: docker.io/kubeflowkatib/file-metrics-collector:v0.17.0-rc.0 Image ID: docker.io/kubeflowkatib/file-metrics-collector@sha256:2582758cc1cefd55d71c737b2c276a44a544b3747949bc15084731118e750dd7 Port: Host Port: Args: -t cmaes-example-wc2gxp47 -m loss;Train-accuracy -o-type minimize -s-db katib-db-manager.kubeflow:65535 -path /var/log/katib/metrics.log -format TEXT State: Running Started: Wed, 29 May 2024 15:25:25 +0300 Ready: True Restart Count: 0 Limits: cpu: 500m ephemeral-storage: 5Gi memory: 100Mi Requests: cpu: 50m ephemeral-storage: 500Mi memory: 10Mi Environment: Mounts: /var/log/katib from metrics-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-brzwq (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-brzwq: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true metrics-volume: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8s default-scheduler Successfully assigned admin/cmaes-example-wc2gxp47-jhmlf to aks-nodepool1-16255669-vmss000001 Normal Pulled 8s kubelet Container image "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0" already present on machine Normal Created 8s kubelet Created container training-container Normal Started 7s kubelet Started container training-container Normal Pulled 7s kubelet Container image "docker.io/kubeflowkatib/file-metrics-collector:v0.17.0-rc.0" already present on machine Normal Created 7s kubelet Created container metrics-logger-and-collector Normal Started 7s kubelet Started container metrics-logger-and-collector ```

Logs when running katib uat (succesfully) on Microk8s 1.26, juju 3.5

Trial logs ```bash ubuntu@ip-172-31-17-107:~$ kubectl logs -n admin cmaes-example-jt5xqvdw-d7htv --all-containers -f I0529 13:42:57.209827 14 main.go:396] Trial Name: cmaes-example-jt5xqvdw 100.0%13:43:00.815799 14 main.go:139] 100.6%13:43:01.555076 14 main.go:139] 100.0%13:43:01.815681 14 main.go:139] 119.3%13:43:02.041549 14 main.go:139] I0529 13:43:02.618775 14 main.go:139] 2024-05-29T13:43:02Z INFO Train Epoch: 1 [0/60000 (0%)] loss=2.2980 I0529 13:43:03.835410 14 main.go:139] 2024-05-29T13:43:03Z INFO Train Epoch: 1 [640/60000 (1%)] loss=2.2872 I0529 13:43:04.438907 14 main.go:139] 2024-05-29T13:43:04Z INFO Train Epoch: 1 [1280/60000 (2%)] loss=2.0531 ... # training [58880/60000 (98%)] loss=0.4249 I0529 13:43:45.536539 14 main.go:139] 2024-05-29T13:43:45Z INFO Train Epoch: 1 [59520/60000 (99%)] loss=0.3053 I0529 13:43:48.549894 14 main.go:139] 2024-05-29T13:43:48Z INFO {metricName: accuracy, metricValue: 0.8436};{metricName: loss, metricValue: 0.4374} I0529 13:43:48.549924 14 main.go:139] I0529 13:43:48.553166 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz I0529 13:43:48.553188 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz I0529 13:43:48.553204 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw I0529 13:43:48.553214 14 main.go:139] I0529 13:43:48.553232 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz I0529 13:43:48.553244 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz I0529 13:43:48.553300 14 main.go:139] Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw I0529 13:43:48.553319 14 main.go:139] I0529 13:43:48.553354 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz I0529 13:43:48.553366 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz I0529 13:43:48.553375 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw I0529 13:43:48.553380 14 main.go:139] I0529 13:43:48.553389 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz I0529 13:43:48.553408 14 main.go:139] Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz I0529 13:43:48.553443 14 main.go:139] Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw I0529 13:43:48.553464 14 main.go:139] I0529 13:43:50.329199 14 main.go:455] Metrics reported. : metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > metric_logs: > ``` experiment logs ```bash ubuntu@ip-172-31-17-107:~$ kubectl logs -n admin cmaes-example-cmaes-79dd5db648-jk5bc --all-containers -f I0529 13:42:35.285566 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789 I0529 13:42:55.787753 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396" ] I0529 13:42:55.787845 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728" ] I0529 13:43:51.732787 1 service.go:117] Update trial mapping : trialName=cmaes-example-jt5xqvdw -> trialID=0 I0529 13:43:51.732806 1 service.go:117] Update trial mapping : trialName=cmaes-example-kndkpvzh -> trialID=1 I0529 13:43:51.732811 1 service.go:147] Detect changes of Trial (trialName=cmaes-example-kndkpvzh, trialID=1) : State Complete, Evaluation 0.269100 I0529 13:43:51.732859 1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138" name:"momentum" value:"0.701003503816815" ] ```

mysql related logs

- [mysql](https://pastebin.canonical.com/p/hd8CmXbvD6/) container - [charm](https://pastebin.canonical.com/p/RcqnCbHC7D/) container ```bash ╰─$ juju run katib-db/0 get-cluster-status Running operation 1 with 1 task - task 2 on unit-katib-db-0 Waiting for task 2... status: clustername: cluster-7fea0e9f40002bd889df6711a8a00d04 clusterrole: primary defaultreplicaset: name: default primary: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306 ssl: required status: ok_no_tolerance statustext: cluster is not tolerant to any failures. topology: katib-db-0: address: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306 memberrole: primary mode: r/w role: ha status: online version: 8.0.36 topologymode: single-primary domainname: cluster-set-7fea0e9f40002bd889df6711a8a00d04 groupinformationsourcemember: katib-db-0.katib-db-endpoints.kubeflow.svc.cluster.local:3306 success: "True" ```

orfeas-k commented 1 month ago

Update

Debugging

We ran the uat on Microk8s 1.28 (juju 3.5.0) and the UAT failed with the same error message.

When editing its k8s svc to have the port 65535 (without juju config), the trials could report metrics.
However, after just killing the katib-controller pod to clear it logs and without doing any configuration, the UAT started succeeding from that point on. Not sure if killing the pod had something to do, but this could show that there is a race happening, resulting in the UAT failing

Misconfigured port (Conclusion?)

It turns out that ineed the training container succeeds while the metrics-collector copmonent fails to report metrics and thus the trial fails. The reason this is failing is indeed the last line of the logs

F0529 12:06:54.922852      14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"

In this line, we see that it is trying to hit 65535 port instead of the default 6789. This was confirmed also by comparing the trial pod yaml and more specifically the args passed to the metrics container between cases when the trial was successful and when it wasn't:

  - args:
    - -t
    - cmaes-example-fqgkd7hd
    - -m
    - loss;Train-accuracy
    - -o-type
    - minimize
    - -s-db
-    - katib-db-manager.kubeflow:6789
+    - katib-db-manager.kubeflow:65535
    - -path
    - /var/log/katib/metrics.log
    - -format
    - TEXT

Upstream code and port setting

From the above, it looks like when the trial is created, those args are applied to the metrics collector container of the pod.

Who is applying those args

Looking at the upstream code:

those args are applied by the mutatingWebhook.
The katib db manager address is then set by the GetDBManagerAddr function
Which is set by the ENV variable KATIB_DB_MANAGER_SERVICE_PORT or defaults to 6789 when there is no variable.

Now what does not make sense is that nowhere in upstream Katib manifests or in katib-operator repository is a KATIB_DB_MANAGER_SERVICE_PORT env variable set.

orfeas-k commented 1 month ago

Port configuration ENV variable

AKS

SSHing into the charms, it turns out that there are multiple katib-related ENV variables set

╰─$ juju ssh katib-db-manager/0
# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142
# exit
╰─$ juju ssh katib-controller/0
#  printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=65535
KATIB_DB_MANAGER_SERVICE_PORT_PLACEHOLDER=65535
KATIB_DB_MANAGER_SERVICE_HOST=10.0.214.142

which explains why the trials' metric-collector container is misconfigured.

Regarding the previous comment, I killed the katib-manager-0 and when it was spun up, those variables were different, which explains why the UAT succeeded yesterday after killing the pod

# printenv | grep -i katib_db_manager_service
KATIB_DB_MANAGER_SERVICE_PORT=6789
KATIB_DB_MANAGER_SERVICE_PORT_API=6789
KATIB_DB_MANAGER_SERVICE_HOST=10.152.183.120

orfeas-k commented 1 month ago

Summary - recap

Problem

Juju and envvars

Copying from Kubernetes_service_patch (KSP) library:

When sidecar charms are deployed, Juju creates a service named after the application in the namespace. This service by default contains a "placeholder" port, which is 65535/TCP.

This ties with the kubernetes native behavior of having container environment variables for every service present on the cluster:

A list of all services that were running when a Container was created is available to that Container as environment variables. This list is limited to services within the same namespace as the new Container's Pod and Kubernetes control plane services.

Thus, when a charm with a pebble service called test-charm is deployed, the following ENV variables are expected to be found in other charm's containers that are created after test-charm:

TEST_CHARM_SERVICE_PORT=<port>
TEST_CHARM_SERVICE_HOST=<test-charm-service-ip>

Note that those ENV variables are not updated after the container has been created.

Katib uses by default `KATIB_DB_MANAGER_SERVICE_PORT`

Copying and rephrasing from two comments above:

When the trial is created, mutatingWebhook applies some args to the metrics collector container of the pod.
Part of those args is also the katib db manager address which is set by the GetDBManagerAddr function
Which uses the the ENV variable KATIB_DB_MANAGER_SERVICE_PORT in order to set the address' port. If not present, it defaults to 6789 when there is no variable.

Result

We can see now how the above two work in combination with each other. The result is a race where:

katib-db-manager service is created but not patched yet (thus its service has a port 65535)
katib-controller container is created with ENV var KATIB_DB_MANAGER_SERVICE_PORT=65535.
katib-db-manager service is patched but this information is not propagated to containers already created (thus to katib-controller containers as well)
The trial ends up with an endpoint that points to 65535. This results in

F0529 12:06:54.922852      14 main.go:453] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.0.35.173:65535: i/o timeout"

The weird thing is that this is intermittent and sometimes this ENV variable has the proper value and sometimes not.

Proposed solution

Until we figure out more on this behaviour and maybe tackle it in a better way, we 'll explicitly set KATIB_DB_MANAGER_SERVICE_PORT. This information will be broadcasted to katib-controller through a new relation between the charm and katib-db-manager. For this, k8s_service_info library will be used.

canonical / bundle-kubeflow

ci(aks): Katib UAT fail on AKS #893

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

Debugging

Experiment logs

Trial logs

Update

Debugging

Misconfigured port (Conclusion?)

Upstream code and port setting

Who is applying those args

Port configuration ENV variable

AKS

Summary - recap

Problem

Juju and envvars

Katib uses by default `KATIB_DB_MANAGER_SERVICE_PORT`

Result

Proposed solution

canonical / bundle-kubeflow

ci(aks): Katib UAT fail on AKS #893

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

Debugging

Experiment logs

Trial logs

Update

Debugging

Misconfigured port (Conclusion?)

Upstream code and port setting

Who is applying those args

Port configuration ENV variable

AKS

Summary - recap

Problem

Juju and envvars

Katib uses by default KATIB_DB_MANAGER_SERVICE_PORT

Result

Proposed solution

Katib uses by default `KATIB_DB_MANAGER_SERVICE_PORT`