`training-integration` notebook fails on self-hosted runners

orfeas-k commented 11 months ago

Bug Description

Running the training-integration UAT notebook on a self-hosted runner fails with AssertionError: Job mnist was not successful.

INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[4], line 8, in assert_job_succeeded(client, job_name, job_kind)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=30),
      3     stop=stop_after_attempt(50),
      4     reraise=True,
      5 )
      6 def assert_job_succeeded(client, job_name, job_kind):
      7     """Wait for the Job to complete successfully."""
----> 8     assert client.is_job_succeeded(
      9         name=job_name, job_kind=job_kind
     10     ), f"Job {job_name} was not successful."
AssertionError: Job mnist was not successful.

Note that the same UAT succeeds on an EC2 instance

To Reproduce

Run the Deploy CKF bundle and run UATs workflow with tests-bundle/1.8 as bundle tests, --file=releases/latest/edge/bundle.yaml as bundle source, 1.25-strict/stable as microk8s-version, 3.1/stable as juju-version and feature-orfeas-lightkube-trustenv as the uats branch.

Environment

Self-hosted runner
Microk8s 1.25-strict/stable
Juju 3.1/stable
CKF latest/edge

Relevant Log Output

```shell INFO test_notebooks:test_notebooks.py:44 Running minio-integration.ipynb... _____________________ test_notebook[training-integration] ______________________ test_notebook = '/tests/notebooks/training/training-integration.ipynb' @pytest.mark.ipynb @pytest.mark.parametrize( # notebook - ipynb file to execute "test_notebook", NOTEBOOKS.values(), ids=NOTEBOOKS.keys(), ) def test_notebook(test_notebook): """Test Notebook Generic Wrapper.""" os.chdir(os.path.dirname(test_notebook)) with open(test_notebook) as nb: notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT) ep = ExecutePreprocessor( timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements ) ep.skip_cells_with_tag = "pytest-skip" try: log.info(f"Running {os.path.basename(test_notebook)}...") output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}}) # persist the notebook output to the original file for debugging purposes save_notebook(output_notebook, test_notebook) except CellExecutionError as e: # handle underlying error pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}") for cell in output_notebook.cells: metadata = cell.get("metadata", dict) if "raises-exception" in metadata.get("tags", []): for cell_output in cell.outputs: if cell_output.output_type == "error": # extract the error message from the cell output log.error(format_error_message(cell_output.traceback)) > pytest.fail(cell_output.traceback[-1]) E Failed: AssertionError: Job mnist was not successful. /tests/test_notebooks.py:59: Failed ... ------------------------------ Captured log call ------------------------------- INFO test_notebooks:test_notebooks.py:44 Running training-integration.ipynb... ERROR test_notebooks:test_notebooks.py:58 Cell In[4], line 8, in assert_job_succeeded(client, job_name, job_kind) 1 @retry( 2 wait=wait_exponential(multiplier=2, min=1, max=30), 3 stop=stop_after_attempt(50), 4 reraise=True, 5 ) 6 def assert_job_succeeded(client, job_name, job_kind): 7 """Wait for the Job to complete successfully.""" ----> 8 assert client.is_job_succeeded( 9 name=job_name, job_kind=job_kind 10 ), f"Job {job_name} was not successful." AssertionError: Job mnist was not successful. ```

Additional Context

No response

orfeas-k commented 11 months ago

Same results on latest/edge

orfeas-k commented 11 months ago

Debugging

Aproxy logs

Going through the aproxy logs and the artifacts' ketall logs, I think that they are irrelevant to the failing jobs in training-integration since

2023-11-17T15:56:37Z aproxy.aproxy[8778]: 2023/11/17 15:56:37 ERROR failed to send HTTP response to connection src=10.1.56.201:39184 original_dst=169.254.169.254:80 host=169.254.169.254:80 error="write tcp 10.8.232.2:8443->10.1.56.201:39184: write: broken pipe"
2023-11-17T15:56:37Z aproxy.aproxy[8778]: 2023/11/17 15:56:37 ERROR failed to send HTTP response to connection src=10.1.56.201:39172 original_dst=169.254.169.254:80 host=169.254.169.254:80 error="write tcp 10.8.232.2:8443->10.1.56.201:39172: write: broken pipe"
2023-11-17T15:56:39Z aproxy.aproxy[8778]: 2023/11/17 15:56:39 INFO relay connection to http proxy src=10.1.56.201:54444 original_dst=151.101.64.223:443 host=pypi.org:443
2023-11-17T15:56:39Z aproxy.aproxy[8778]: 2023/11/17 15:56:39 INFO relay connection to http proxy src=10.1.56.201:53514 original_dst=199.232.53.55:443 host=files.pythonhosted.org:443
2023-11-17T15:56:42Z aproxy.aproxy[8778]: 2023/11/17 15:56:42 ERROR failed to preread host from connection src=10.1.56.199:56028 original_dst=169.254.169.254:80 error="failed to preread HTTP request: EOF"
I see 10.8.232.2 being the HostIP for a container of many pods,

10.1.56.201 being the PodIP in a container of pod test-kubeflow-sth which also has an annotation cni.projectcalico.org/podIP: 10.1.56.201/32.

10.1.56.199 is the PodIP of the initcontainer of the following pod

name: ml-pipeline-visualizationserver-7b5889796d-zx4kr
namespace: test-kubeflow

Thus I dont think they relate somehow to the reason the tfjob fails.

Image code

Looking at the image's code, I don't see sth specific that could require internet connection. My best guess at the moment is that a pod fails to talk to another pod spun by the job.

Removing the test's teardown

Removing the test's teardown so we can see the state of the cluster when the job failed, I don't see any failed pods. Only those that are in a Running state.

test-kubeflow                         pod/pytorch-dist-mnist-gloo-master-0                   1/1     Running   0               5m9s
test-kubeflow                         pod/pytorch-dist-mnist-gloo-worker-0                   1/1     Running   0               5m7s

orfeas-k commented 11 months ago

Reran the test after adding the following last line in Jobs' container as @nishant-dash did in order to get those working behind a proxy.

container = V1Container(
    name=<name>,
    image=<image>,
    args=<args-list>,
    env=[V1EnvVar(name="https_proxy", value="http://squid.internal:3128"), V1EnvVar(name="http_proxy", value="http://squid.internal:3128"), V1EnvVar(name="HTTPS_PROXY", value="http://squid.internal:3128"), V1EnvVar(name="HTTP_PROXY", value="http://squid.internal:3128")],
)

but they failed with a different error

tenacity.RetryError: RetryError[<Future at 0x7f533e25a5e0 state=finished returned bool>]

It's also notable that the job took 20 minutes more to complete this time which could mean that it failed on a later stage (another job maybe). At this point, we should schedule a call witih IS in order to SSH into the instance and find the cause of the above failure.

syncronize-issues-to-jira[bot] commented 5 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5734.

This message was autogenerated

misohu commented 1 month ago

Currently we are not planing to work with self hosted runners.

canonical / charmed-kubeflow-uats