`katib-integration` UAT is failing for 1.9/beta

DnPlas commented 2 weeks ago

Bug Description

Running the UATs on 1.9/beta fails for Katib.

To Reproduce

juju deploy kubeflow --channel 1.9/beta --trust
Run tox -ve kubeflow-local following these steps
Observe the result

Environment

microk8s 1.29-strict/stable
juju 3.4/stable (3.4.4)
UATs at commit fe86b4e

Relevant Log Output

=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
            output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}")

        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

/tests/test_notebooks.py:59: Failed

Additional Context

No response

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5975.

This message was autogenerated

orfeas-k commented 1 week ago

Running in a `1.9/edge` (aka `latest/edge`) deployment

Environment

MicroK8s v1.29.5 revision 6884
Juju 3.4.4-genericlinux-amd64

Driver

The error I see is the following

wip

while the experiments have succeeded. See logs from experiment

╰─$ k logs -n test-kubeflow cmaes-example-cmaes-5bd986458-9wxsw -f                         
I0716 12:04:16.195622       1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0716 12:04:35.963904       1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr"  value:"0.04188612100654" name:"momentum"  value:"0.7043612817216396"]
I0716 12:04:35.964253       1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr"  value:"0.04511033252270099" name:"momentum"  value:"0.6980954001565728"]
I0716 12:07:38.383010       1 service.go:117] Update trial mapping : trialName=cmaes-example-52zt44qf -> trialID=0
I0716 12:07:38.383040       1 service.go:117] Update trial mapping : trialName=cmaes-example-mckhkcmb -> trialID=1
I0716 12:07:38.383048       1 service.go:147] Detect changes of Trial (trialName=cmaes-example-mckhkcmb, trialID=1) : State Complete, Evaluation 0.269100
I0716 12:07:38.383154       1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr"  value:"0.02556132716757138" name:"momentum"  value:"0.701003503816815"]

From a notebook (UI)

Running it from this PR's branch https://github.com/canonical/charmed-kubeflow-uats/pull/92/files at 73a6ee5a016720f6e2f8acdf2b59bb0fd0df1e2a and from main branch, it fails with the following error


---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[10], line 6
  3 client.get_experiment(name=EXPERIMENT_NAME)
  5 # wait for the Experiment to complete successfully
----> 6 assert_experiment_succeeded(client, EXPERIMENT_NAME)

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:336, in BaseRetrying.wraps..wrapped_f(*args, *kw) 334 copy = self.copy() 335 wrapped_f.statistics = copy.statistics # type: ignore[attr-defined] --> 336 return copy(f, args, **kw)

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:475, in Retrying.call(self, fn, *args, **kwargs) 473 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs) 474 while True: --> 475 do = self.iter(retry_state=retry_state) 476 if isinstance(do, DoAttempt): 477 try:

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:376, in BaseRetrying.iter(self, retry_state) 374 result = None 375 for action in self.iter_state.actions: --> 376 result = action(retry_state) 377 return result

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:418, in BaseRetrying._post_stop_check_actions..exc_check(rs) 416 retry_exc = self.retry_error_cls(fut) 417 if self.reraise: --> 418 raise retry_exc.reraise() 419 raise retry_exc from fut.exception()

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:185, in RetryError.reraise(self) 183 def reraise(self) -> t.NoReturn: 184 if self.last_attempt.failed: --> 185 raise self.last_attempt.result() 186 raise self

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout) 447 raise CancelledError() 448 elif self._state == FINISHED: --> 449 return self.__get_result() 451 self._condition.wait(timeout) 453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self) 399 if self._exception: 400 try: --> 401 raise self._exception 402 finally: 403 # Break a reference cycle with the exception in self._exception 404 self = None

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:478, in Retrying.call(self, fn, *args, *kwargs) 476 if isinstance(do, DoAttempt): 477 try: --> 478 result = fn(args, **kwargs) 479 except BaseException: # noqa: B902 480 retry_state.set_exception(sys.exc_info()) # type: ignore[arg-type]

Cell In[9], line 8, in assert_experiment_succeeded(client, experiment) 1 @retry( 2 wait=wait_exponential(multiplier=2, min=1, max=10), 3 stop=stop_after_attempt(30), 4 reraise=True, 5 ) 6 def assert_experiment_succeeded(client, experiment): 7 """Wait for the Katib Experiment to complete successfully.""" ----> 8 assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."

AssertionError: Katib Experiment was not successful.

  while the experiment hasn't failed

╰─$ k logs -n admin cmaes-example-cmaes-5bd986458-vlmb4 -f I0716 12:26:28.166999 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789 I0716 12:26:48.748339 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396"] I0716 12:26:48.748436 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728"] I0716 12:29:58.070186 1 service.go:117] Update trial mapping : trialName=cmaes-example-z7pnvrwl -> trialID=1 I0716 12:29:58.070571 1 service.go:147] Detect changes of Trial (trialName=cmaes-example-z7pnvrwl, trialID=1) : State Complete, Evaluation 0.269100 I0716 12:29:58.070807 1 service.go:117] Update trial mapping : trialName=cmaes-example-vqrxjvdp -> trialID=0 I0716 12:29:58.070995 1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138" name:"momentum" value:"0.701003503816815"]


Looking at the [outputted experiment](https://pastebin.canonical.com/p/BkQHHHMv55/) in the notebook's prints, we see that one trial is still running and that there's no failed trial. Increasing the timeout and rerunning from a new notebook (so we still need to download the same data), it looks like the UAT succeeds.

orfeas-k commented 1 week ago

Solution

Since this was a timeout issue, I 'm sending a PR that increases the batch-size to a large enough number so the experiment doesn't perform that much training and completes earlier. The purpose of the UATs are to confirm that the workloads are working rather than perform complete tasks with them.

canonical / katib-operators