ci(aks): Katib UAT fail on EKS juju 3.5.0

misohu commented 1 week ago

Bug Description

This UAT fails for bundle latest/edge in our CI with AssertionError: Katib Experiment was not successful. The error was found during change for juju agent to 3.5.0 in this PR.

Example run https://github.com/canonical/bundle-kubeflow/actions/runs/9578729770. Note that these tests are working on AKS. Here is run for the same thing on AKS: https://github.com/canonical/bundle-kubeflow/actions/runs/9578503353

To Reproduce

Rerun the CI from the branch KF-5847-agent-version-aks-eks-3-5-0 or just from main for "latest/edge"

Environment

EKS: 1.26 Juju agent: 3.5.0 CKF: latest/edge

Relevant Log Output

test_notebook = '/tests/.worktrees/ad0922d6911a11f480ed4edfb7b8c5d7ad9c1e7f/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
            output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}")

        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

/tests/.worktrees/ad0922d6911a11f480ed4edfb7b8c5d7ad9c1e7f/tests/test_notebooks.py:59: Failed