canonical / charmed-kubeflow-uats

Automated UATs for Charmed Kubeflow
Apache License 2.0
6 stars 2 forks source link

AssertionError: Katib Experiment was not successful #101

Open mvlassis opened 3 months ago

mvlassis commented 3 months ago

Bug Description

This issue was encountered in the deploy-cfk-to-eks (1.8) action in bundle-kubeflow repository. The full logs can be found here.

The katib-integration test in test_notebook.py fails and raises an AssertionError. This is the relevant log call from the logs:

-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running katib-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[8], line 8, in assert_experiment_succeeded(client, experiment)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=10),
      3     stop=stop_after_attempt(30),
      4     reraise=True,
      5 )
      6 def assert_experiment_succeeded(client, experiment):
      7     """Wait for the Katib Experiment to complete successfully."""
----> 8     assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
FAILED

Because the error was encountered during a Github action, I couldn't access the deployment and investigate further.

Note that this issue was not encountered during a previous run of the Github action, which can be found here. It's not clear whether this issue is reproducible or just intermittent.

To Reproduce

From the main page of the bundle-kubeflow repository, go to Actions, select the "Create EKS cluster, deploy CKF and run bundle test" action, and run it with the following options:

Environment

This job tries to deploy the UATs, using the following configuration from the dependencies.yaml file found here:

Relevant Log Output

-------------------------------- live log call ---------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running katib-integration.ipynb...
ERROR    test_notebooks:test_notebooks.py:58 Cell In[8], line 8, in assert_experiment_succeeded(client, experiment)
      1 @retry(
      2     wait=wait_exponential(multiplier=2, min=1, max=10),
      3     stop=stop_after_attempt(30),
      4     reraise=True,
      5 )
      6 def assert_experiment_succeeded(client, experiment):
      7     """Wait for the Katib Experiment to complete successfully."""
----> 8     assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
FAILED

Additional Context

No response

syncronize-issues-to-jira[bot] commented 3 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6112.

This message was autogenerated

misohu commented 2 months ago

@orfeas-k can you rerun it and make sure its gone ?

misohu commented 2 months ago

May be related to https://github.com/canonical/bundle-kubeflow/issues/893 https://github.com/canonical/bundle-kubeflow/issues/942

orfeas-k commented 2 months ago

I reran the CI here https://github.com/canonical/bundle-kubeflow/actions/runs/10388848456/job/28765995378 and it looks like it succeeds which means that we have to deal with an intermittent issue