canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
103 stars 50 forks source link

ci(aks): Training-operator UAT fails on AKS k8s 1.28 #894

Closed orfeas-k closed 3 months ago

orfeas-k commented 4 months ago

Bug Description

Training-operator UAT starts failing after bumping k8s-version to 1.28 on AKS with AssertionError: Job pytorch-dist-mnist-gloo was not successful.. This is the case both for CKF latest/edge and 1.8/stable. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run https://github.com/canonical/charmed-kubeflow-uats/issues/4.

Example runs

To Reproduce

Run CI for k8s version 1.28

Environment

AKS k8s 1.28 Juju 3.1

for 1.8 juju status

Model     Controller      Cloud/Region    Version  SLA          Timestamp
kubeflow  aks-controller  aks/westeurope  3.1.8    unsupported  09:19:35Z

App                        Version                  Status  Scale  Charm                    Channel          Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        1.8/stable       301  10.0.245.250  no       
argo-controller                                     active      1  argo-controller          3.3.10/stable    424  10.0.249.157  no       
dex-auth                                            active      1  dex-auth                 2.36/stable      422  10.0.185.107  no       
envoy                      res:oci-image@cc06b3e    active      1  envoy                    2.0/stable       101  10.0.244.49   no       
istio-ingressgateway                                active      1  istio-gateway            1.17/stable      723  10.0.216.118  no       
istio-pilot                                         active      1  istio-pilot              1.17/stable      827  10.0.173.92   no       
jupyter-controller                                  active      1  jupyter-controller       1.8/stable       849  10.0.75.253   no       
jupyter-ui                                          active      1  jupyter-ui               1.8/stable       858  10.0.184.139  no       
katib-controller           res:oci-image@b6a6100    active      1  katib-controller         0.16/stable      446  10.0.106.5    no       
katib-db                   8.0.35-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       127  10.0.233.45   no       
katib-db-manager                                    active      1  katib-db-manager         0.16/stable      411  10.0.188.36   no       
katib-ui                                            active      1  katib-ui                 0.16/stable      422  10.0.126.70   no       
kfp-api                                             active      1  kfp-api                  2.0/stable      1035  10.0.86.37    no       
kfp-db                     8.0.35-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/stable       127  10.0.57.119   no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      2.0/stable       118  10.0.61.100   no       
kfp-persistence                                     active      1  kfp-persistence          2.0/stable      1039  10.0.131.226  no       
kfp-profile-controller                              active      1  kfp-profile-controller   2.0/stable       998  10.0.184.246  no       
kfp-schedwf                                         active      1  kfp-schedwf              2.0/stable      1052  10.0.234.76   no       
kfp-ui                                              active      1  kfp-ui                   2.0/stable      1034  10.0.225.138  no       
kfp-viewer                                          active      1  kfp-viewer               2.0/stable      1064  10.0.229.253  no       
kfp-viz                                             active      1  kfp-viz                  2.0/stable       985  10.0.134.29   no       
knative-eventing                                    active      1  knative-eventing         1.10/stable      353  10.0.44.250   no       
knative-operator                                    active      1  knative-operator         1.10/stable      328  10.0.68.158   no       
knative-serving                                     active      1  knative-serving          1.10/stable      354  10.0.61.216   no       
kserve-controller                                   active      1  kserve-controller        0.11/stable      523  10.0.11.66    no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       1.8/stable       454  10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:15)7.5    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        1.8/stable       355  10.0.68.5     no       
kubeflow-roles                                      active      1  kubeflow-roles           1.8/stable       187  10.0.196.222  no       
kubeflow-volumes           res:oci-image@2261827    active      1  kubeflow-volumes         1.8/stable       260  10.0.29.7     no       
metacontroller-operator                             active      1  metacontroller-operator  3.0/stable       252  10.0.66.178   no       
minio                      res:oci-image@1755999    active      1  minio                    ckf-1.8/stable   278  10.0.247.208  no       
mlmd                       res:oci-image@44abc5d    active      1  mlmd                     1.14/stable      127  10.0.219.231  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          ckf-1.8/stable   350  10.0.38.12    no       
pvcviewer-operator                                  active      1  pvcviewer-operator       1.8/stable        30  10.0.238.124  no       
seldon-controller-manager                           active      1  seldon-core              1.17/stable      664  10.0.22.127   no       
tensorboard-controller                              active      1  tensorboard-controller   1.8/stable       257  10.0.44.54    no       
tensorboards-web-app                                active      1  tensorboards-web-app     1.8/stable       245  10.0.204.180  no       
training-operator                                   active      1  training-operator        1.7/stable       347  10.0.91.235   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.0.10                 
argo-controller/0*            active    idle   10.244.1.6                  
dex-auth/0*                   active    idle   10.244.0.12                 
envoy/0*                      active    idle   10.244.1.34  9090,9901/TCP  
istio-ingressgateway/0*       active    idle   10.244.1.7                  
istio-pilot/0*                active    idle   10.244.0.13                 
jupyter-controller/0*         active    idle   10.244.0.14                 
jupyter-ui/0*                 active    idle   10.244.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:16)                 
katib-controller/0*           active    idle   10.244.0.34  443,8080/TCP   
katib-db-manager/0*           active    idle   10.244.1.10                 
katib-db/0*                   active    idle   10.244.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:17)                 Primary
katib-ui/0*                   active    idle   10.244.1.11                 
kfp-api/0*                    active    idle   10.244.1.12                 
kfp-db/0*                     active    idle   10.244.1.13                 Primary
kfp-metadata-writer/0*        active    idle   10.244.0.18                 
kfp-persistence/0*            active    idle   10.244.0.20                 
kfp-profile-controller/0*     active    idle   10.244.0.22                 
kfp-schedwf/0*                active    idle   10.244.0.23                 
kfp-ui/0*                     active    idle   10.244.0.24                 
kfp-viewer/0*                 active    idle   10.244.1.15                 
kfp-viz/0*                    active    idle   10.244.0.26                 
knative-eventing/0*           active    idle   10.244.0.[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:18)                 
knative-operator/0*           active    idle   10.244.0.28                 
knative-serving/0*            active    idle   10.244.0.21                 
kserve-controller/0*          active    idle   10.244.1.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:19)                 
kubeflow-dashboard/0*         active    idle   10.244.0.27                 
kubeflow-profiles/0*          active    idle   10.244.1.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:20)                 
kubeflow-roles/0*             active    idle   10.244.1.14                 
kubeflow-volumes/0*           active    idle   10.244.1.21  5000/TCP       
metacontroller-operator/0*    active    idle   10.244.0.25                 
minio/0*                      active    idle   10.244.0.35  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.1.35  8080/TCP       
oidc-gatekeeper/0*            active    idle   10.244.1.16                 
pvcviewer-operator/0*         active    idle   10.244.1.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:21)                 
seldon-controller-manager/0*  active    idle   10.244.1.17                 
tensorboard-controller/0*     active    idle   10.244.0.30                 
tensorboards-web-app/0*       active    idle   10.244.0.31                 
training-operator/0*          active    idle   10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:25)4.0.32

for latest/edge juju status

Model     Controller      Cloud/Region    Version  SLA          Timestamp
kubeflow  aks-controller  aks/westeurope  3.1.8    unsupported  09:26:07Z

App                        Version                  Status  Scale  Charm                    Channel       Rev  Address       Exposed  Message
admission-webhook                                   active      1  admission-webhook        latest/edge   308  10.0.16.94    no       
argo-controller                                     active      1  argo-controller          latest/edge   468  10.0.100.236  no       
dex-auth                                            active      1  dex-auth                 latest/edge   458  10.0.254.87   no       
envoy                                               active      1  envoy                    latest/edge   183  10.0.245.125  no       
istio-ingressgateway                                active      1  istio-gateway            latest/edge   900  10.0.44.117   no       
istio-pilot                                         active      1  istio-pilot              latest/edge   872  10.0.21.240   no       
jupyter-controller                                  active      1  jupyter-controller       latest/edge   936  10.0.131.139  no       
jupyter-ui                                          active      1  jupyter-ui               latest/edge   856  10.0.90.58    no       
katib-controller                                    active      1  katib-controller         latest/edge   526  10.0.152.253  no       
katib-db                   8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.152.2    no       
katib-db-manager                                    active      1  katib-db-manager         latest/edge   490  10.0.236.4    no       
katib-ui                                            active      1  katib-ui                 latest/edge   501  10.0.92.22    no       
kfp-api                                             active      1  kfp-api                  latest/edge  1244  10.0.176.102  no       
kfp-db                     8.0.36-0ubuntu0.22.04.1  active      1  mysql-k8s                8.0/edge      138  10.0.3.211    no       
kfp-metadata-writer                                 active      1  kfp-metadata-writer      latest/edge   298  10.0.201.207  no       
kfp-persistence                                     active      1  kfp-persistence          latest/edge  1251  10.0.6.212    no       
kfp-profile-controller                              active      1  kfp-profile-controller   latest/edge  1209  10.0.253.135  no       
kfp-schedwf                                         active      1  kfp-schedwf              latest/edge  1263  10.0.6.119    no       
kfp-ui                                              active      1  kfp-ui                   latest/edge  1246  10.0.221.196  no       
kfp-viewer                                          active      1  kfp-viewer               latest/edge  1276  10.0.137.58   no       
kfp-viz                                             active      1  kfp-viz                  latest/edge  1197  10.0.127.237  no       
knative-eventing                                    active      1  knative-eventing         latest/edge   393  10.0.110.159  no       
knative-operator                                    active      1  knative-operator         latest/edge   368  10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:15)5.205  no       
knative-serving                                     active      1  knative-serving          latest/edge   394  10.0.147.68   no       
kserve-controller                                   active      1  kserve-controller        latest/edge   538  10.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:16)4.87   no       
kubeflow-dashboard                                  active      1  kubeflow-dashboard       latest/edge   517  10.0.52.98    no       
kubeflow-profiles                                   active      1  kubeflow-profiles        latest/edge   379  10.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:17)4.223  no       
kubeflow-roles                                      active      1  kubeflow-roles           latest/edge   207  10.0.205.101  no       
kubeflow-volumes                                    active      1  kubeflow-volumes         latest/edge   279  10.0.83.113   no       
metacontroller-operator                             active      1  metacontroller-operator  latest/edge   280  10.0.153.8    no       
minio                      res:oci-image@[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:18)55999    active      1  minio                    latest/edge   306  10.0.52.197   no       
mlmd                                                active      1  mlmd                     latest/edge   174  10.0.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:19)8.218  no       
oidc-gatekeeper                                     active      1  oidc-gatekeeper          latest/edge   371  10.0.125.250  no       
pvcviewer-operator                                  active      1  pvcviewer-operator       latest/edge    74  10.0.97.108   no       
seldon-controller-manager                           active      1  seldon-core              latest/edge   691  10.0.87.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:20)5   no       
tensorboard-controller                              active      1  tensorboard-controller   latest/edge   281  10.0.30.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:21)1   no       
tensorboards-web-app                                active      1  tensorboards-web-app     latest/edge   269  10.0.24.183   no       
training-operator                                   active      1  training-operator        latest/edge   378  10.0.16.237   no       

Unit                          Workload  Agent  Address      Ports          Message
admission-webhook/0*          active    idle   10.244.0.7                  
argo-controller/0*            active    idle   10.244.1.9                  
dex-auth/0*                   active    idle   10.244.0.8                  
envoy/0*                      active    idle   10.244.1.11                 
istio-ingressgateway/0*       active    idle   10.244.1.10                 
istio-pilot/0*                active    idle   10.244.0.9                  
jupyter-controller/0*         active    idle   10.244.1.12                 
jupyter-ui/0*                 active    idle   10.244.1.14                 
katib-controller/0*           active    idle   10.244.1.15                 
katib-db-manager/0*           active    idle   10.244.1.16                 
katib-db/0*                   active    idle   10.244.0.12                 Primary
katib-ui/0*                   active    idle   10.244.1.17                 
kfp-api/0*                    active    idle   10.244.1.18                 
kfp-db/0*                     active    idle   10.244.1.19                 Primary
kfp-metadata-writer/0*        active    idle   10.244.0.13                 
kfp-persistence/0*            active    idle   10.244.0.15                 
kfp-profile-controller/0*     active    idle   10.244.0.16                 
kfp-schedwf/0*                active    idle   10.244.0.18                 
kfp-ui/0*                     active    idle   10.244.1.20                 
kfp-viewer/0*                 active    idle   10.244.0.19                 
kfp-viz/0*                    active    idle   10.244.0.20                 
knative-eventing/0*           active    idle   10.244.0.14                 
knative-operator/0*           active    idle   10.244.0.22                 
knative-serving/0*            active    idle   10.244.0.17                 
kserve-controller/0*          active    idle   10.244.1.25                 
kubeflow-dashboard/0*         active    idle   10.244.1.23                 
kubeflow-profiles/0*          active    idle   10.244.0.24                 
kubeflow-roles/0*             active    idle   10.244.1.[21](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:22)                 
kubeflow-volumes/0*           active    idle   10.244.0.21                 
metacontroller-operator/0*    active    idle   10.244.1.[22](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:23)                 
minio/0*                      active    idle   10.244.1.24  9000-9001/TCP  
mlmd/0*                       active    idle   10.244.1.28                 
oidc-gatekeeper/0*            active    idle   10.244.0.[23](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:24)                 
pvcviewer-operator/0*         active    idle   10.244.0.26                 
seldon-controller-manager/0*  active    idle   10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:25)4.1.26                 
tensorboard-controller/0*     active    idle   10.244.1.27                 
tensorboards-web-app/0*       active    idle   10.244.0.[25](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:26)                 
training-operator/0*          active    idle   10.244.1.29

Relevant Log Output

test_notebooks.py::test_notebook[training-integration] 
-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[4], line 8, in assert_job_succeeded(client, job_name, job_kind)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=30),
      3     stop=stop_after_attempt(50),
      4     reraise=True,
      5 )
      6 def assert_job_succeeded(client, job_name, job_kind):
      7     """Wait for the Job to complete successfully."""
----> 8     assert client.is_job_succeeded(
      9         name=job_name, job_kind=job_kind
     10     ), f"Job ***job_name*** was not successful."
AssertionError: Job pytorch-dist-mnist-gloo was not successful.
FAILED                                                                   [100%]

=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running ***os.path.basename(test_notebook)***...")
            output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")

        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

Additional Context

No response

syncronize-issues-to-jira[bot] commented 4 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5650.

This message was autogenerated

orfeas-k commented 4 months ago

Did some exploration here and those are the logs when running the same thing locally

Describe worker pod

╰─$ kdp -n test-kubeflow pytorch-dist-mnist-gloo-worker-0
...
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  4m44s                 default-scheduler  Successfully assigned test-kubeflow/pytorch-dist-mnist-gloo-worker-0 to aks-nodepool1-16255669-vmss000000
  Normal   Pulling    4m43s                 kubelet            Pulling image "alpine:3.10"
  Normal   Pulled     4m41s                 kubelet            Successfully pulled image "alpine:3.10" in 2.24s (2.24s including waiting)
  Normal   Created    4m41s                 kubelet            Created container init-pytorch
  Normal   Started    4m41s                 kubelet            Started container init-pytorch
  Normal   Pulling    3m38s                 kubelet            Pulling image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0"
  Normal   Pulled     2m55s                 kubelet            Successfully pulled image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" in 42.742s (42.742s including waiting)
  Normal   Created    2m1s (x4 over 2m55s)  kubelet            Created container pytorch
  Normal   Started    2m1s (x4 over 2m55s)  kubelet            Started container pytorch
  Warning  BackOff    79s (x7 over 2m39s)   kubelet            Back-off restarting failed container pytorch in pod pytorch-dist-mnist-gloo-worker-0_test-kubeflow(ff109a33-1fa8-47c4-be31-3aff036c64b9)
  Normal   Pulled     68s (x4 over 2m41s)   kubelet            Container image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" already present on machine

Describe master pod

╰─$ kdp -n test-kubeflow pytorch-dist-mnist-gloo-master-0
...
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m58s                  default-scheduler  Successfully assigned test-kubeflow/pytorch-dist-mnist-gloo-master-0 to aks-nodepool1-16255669-vmss000001
  Normal   Pulling    6m57s                  kubelet            Pulling image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0"
  Normal   Pulled     6m14s                  kubelet            Successfully pulled image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" in 43.869s (43.869s including waiting)
  Normal   Created    3m20s (x5 over 6m14s)  kubelet            Created container pytorch
  Normal   Pulled     3m20s (x4 over 5m7s)   kubelet            Container image "gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0" already present on machine
  Normal   Started    3m19s (x5 over 6m13s)  kubelet            Started container pytorch
  Warning  BackOff    84s (x10 over 4m53s)   kubelet            Back-off restarting failed container pytorch in pod pytorch-dist-mnist-gloo-master-0_test-kubeflow(6ca00177-a5f1-4db2-83f1-344176c40481)

Logs from worker pod

╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-worker-0 
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 123, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
    epoch, batch_idx * len(data), len(train_loader.dataset),
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
    if should_distribute():
  File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

at a differnt point in time:
╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-worker-0                 
Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Using distributed PyTorch with gloo backend
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 116, in main
    dist.init_process_group(backend=args.backend)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, start_daemon)
ValueError: host not found: Name or service not known

# from init container
# should be totally irrelevant since it succeeds in the end
╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-worker-0 -c init-pytorch
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'pytorch-dist-mnist-gloo-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve
...
nslookup: can't resolve 'pytorch-dist-mnist-gloo-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve

Name:      pytorch-dist-mnist-gloo-master-0
Address 1: 10.244.1.54 10-244-1-54.pytorch-dist-mnist-gloo-master-0.test-kubeflow.svc.cluster.local

Logs from master pod. I think the same message is just propagated from the worker.

╰─$ kl -n test-kubeflow pytorch-dist-mnist-gloo-master-0                 
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 123, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
    epoch, batch_idx * len(data), len(train_loader.dataset),
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
    if should_distribute():
  File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
kimwnasptd commented 3 months ago

@orfeas-k looks like the example (from upstream?) is not working. The 403 is because the file that it tries to download returns 403

http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

So we'll need to use a different example. Something similar had happened with Katib in the past https://github.com/canonical/charmed-kubeflow-uats/issues/64

orfeas-k commented 3 months ago

Looks like the issue comes from using an out-of-date image. Upstream faced a similar problem https://github.com/kubeflow/training-operator/pull/2083 and they updated the image

In the current PyTorch (gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0) and Horovod (horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu) images, it seems that we can not download the model because both images are too older. So, I replaced those images with the horovod/horovod:0.28.1 and the kubeflow/pytorch-dist-mnist:latest.