canonical / katib-operators

Operators for Katib which is part of Charmed Kubeflow.
Apache License 2.0
1 stars 3 forks source link

`katib-integration` UAT is failing for 1.9/beta #211

Closed DnPlas closed 1 week ago

DnPlas commented 2 weeks ago

Bug Description

Running the UATs on 1.9/beta fails for Katib.

To Reproduce

  1. juju deploy kubeflow --channel 1.9/beta --trust
  2. Run tox -ve kubeflow-local following these steps
  3. Observe the result

Environment

Relevant Log Output

=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))

        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)

        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"

        try:
            log.info(f"Running {os.path.basename(test_notebook)}...")
            output_notebook, _ = ep.preprocess(notebook, {"metadata": {"path": "./"}})
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with {e.ename}: {e.evalue}")

        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

/tests/test_notebooks.py:59: Failed

Additional Context

No response

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5975.

This message was autogenerated

orfeas-k commented 1 week ago

Running in a 1.9/edge (aka latest/edge) deployment

Environment

MicroK8s v1.29.5 revision 6884
Juju 3.4.4-genericlinux-amd64

Driver

The error I see is the following

wip

while the experiments have succeeded. See logs from experiment

╰─$ k logs -n test-kubeflow cmaes-example-cmaes-5bd986458-9wxsw -f                         
I0716 12:04:16.195622       1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0716 12:04:35.963904       1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr"  value:"0.04188612100654" name:"momentum"  value:"0.7043612817216396"]
I0716 12:04:35.964253       1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr"  value:"0.04511033252270099" name:"momentum"  value:"0.6980954001565728"]
I0716 12:07:38.383010       1 service.go:117] Update trial mapping : trialName=cmaes-example-52zt44qf -> trialID=0
I0716 12:07:38.383040       1 service.go:117] Update trial mapping : trialName=cmaes-example-mckhkcmb -> trialID=1
I0716 12:07:38.383048       1 service.go:147] Detect changes of Trial (trialName=cmaes-example-mckhkcmb, trialID=1) : State Complete, Evaluation 0.269100
I0716 12:07:38.383154       1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr"  value:"0.02556132716757138" name:"momentum"  value:"0.701003503816815"]

From a notebook (UI)

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:336, in BaseRetrying.wraps..wrapped_f(*args, *kw) 334 copy = self.copy() 335 wrapped_f.statistics = copy.statistics # type: ignore[attr-defined] --> 336 return copy(f, args, **kw)

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:475, in Retrying.call(self, fn, *args, **kwargs) 473 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs) 474 while True: --> 475 do = self.iter(retry_state=retry_state) 476 if isinstance(do, DoAttempt): 477 try:

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:376, in BaseRetrying.iter(self, retry_state) 374 result = None 375 for action in self.iter_state.actions: --> 376 result = action(retry_state) 377 return result

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:418, in BaseRetrying._post_stop_check_actions..exc_check(rs) 416 retry_exc = self.retry_error_cls(fut) 417 if self.reraise: --> 418 raise retry_exc.reraise() 419 raise retry_exc from fut.exception()

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:185, in RetryError.reraise(self) 183 def reraise(self) -> t.NoReturn: 184 if self.last_attempt.failed: --> 185 raise self.last_attempt.result() 186 raise self

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout) 447 raise CancelledError() 448 elif self._state == FINISHED: --> 449 return self.__get_result() 451 self._condition.wait(timeout) 453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File /opt/conda/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self) 399 if self._exception: 400 try: --> 401 raise self._exception 402 finally: 403 # Break a reference cycle with the exception in self._exception 404 self = None

File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:478, in Retrying.call(self, fn, *args, *kwargs) 476 if isinstance(do, DoAttempt): 477 try: --> 478 result = fn(args, **kwargs) 479 except BaseException: # noqa: B902 480 retry_state.set_exception(sys.exc_info()) # type: ignore[arg-type]

Cell In[9], line 8, in assert_experiment_succeeded(client, experiment) 1 @retry( 2 wait=wait_exponential(multiplier=2, min=1, max=10), 3 stop=stop_after_attempt(30), 4 reraise=True, 5 ) 6 def assert_experiment_succeeded(client, experiment): 7 """Wait for the Katib Experiment to complete successfully.""" ----> 8 assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."

AssertionError: Katib Experiment was not successful.

  while the experiment hasn't failed

╰─$ k logs -n admin cmaes-example-cmaes-5bd986458-vlmb4 -f I0716 12:26:28.166999 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789 I0716 12:26:48.748339 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396"] I0716 12:26:48.748436 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728"] I0716 12:29:58.070186 1 service.go:117] Update trial mapping : trialName=cmaes-example-z7pnvrwl -> trialID=1 I0716 12:29:58.070571 1 service.go:147] Detect changes of Trial (trialName=cmaes-example-z7pnvrwl, trialID=1) : State Complete, Evaluation 0.269100 I0716 12:29:58.070807 1 service.go:117] Update trial mapping : trialName=cmaes-example-vqrxjvdp -> trialID=0 I0716 12:29:58.070995 1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138" name:"momentum" value:"0.701003503816815"]


Looking at the [outputted experiment](https://pastebin.canonical.com/p/BkQHHHMv55/) in the notebook's prints, we see that one trial is still running and that there's no failed trial. Increasing the timeout and rerunning from a new notebook (so we still need to download the same data), it looks like the UAT succeeds.
orfeas-k commented 1 week ago

Solution

Since this was a timeout issue, I 'm sending a PR that increases the batch-size to a large enough number so the experiment doesn't perform that much training and completes earlier. The purpose of the UATs are to confirm that the workloads are working rather than perform complete tasks with them.