Closed DnPlas closed 1 week ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5975.
This message was autogenerated
1.9/edge
(aka latest/edge
) deploymentMicroK8s v1.29.5 revision 6884
Juju 3.4.4-genericlinux-amd64
The error I see is the following
wip
while the experiments have succeeded. See logs from experiment
╰─$ k logs -n test-kubeflow cmaes-example-cmaes-5bd986458-9wxsw -f
I0716 12:04:16.195622 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789
I0716 12:04:35.963904 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396"]
I0716 12:04:35.964253 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728"]
I0716 12:07:38.383010 1 service.go:117] Update trial mapping : trialName=cmaes-example-52zt44qf -> trialID=0
I0716 12:07:38.383040 1 service.go:117] Update trial mapping : trialName=cmaes-example-mckhkcmb -> trialID=1
I0716 12:07:38.383048 1 service.go:147] Detect changes of Trial (trialName=cmaes-example-mckhkcmb, trialID=1) : State Complete, Evaluation 0.269100
I0716 12:07:38.383154 1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138" name:"momentum" value:"0.701003503816815"]
main
branch, it fails with the following error
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[10], line 6
3 client.get_experiment(name=EXPERIMENT_NAME)
5 # wait for the Experiment to complete successfully
----> 6 assert_experiment_succeeded(client, EXPERIMENT_NAME)
File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:336, in BaseRetrying.wraps.
File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:475, in Retrying.call(self, fn, *args, **kwargs) 473 retry_state = RetryCallState(retry_object=self, fn=fn, args=args, kwargs=kwargs) 474 while True: --> 475 do = self.iter(retry_state=retry_state) 476 if isinstance(do, DoAttempt): 477 try:
File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:376, in BaseRetrying.iter(self, retry_state) 374 result = None 375 for action in self.iter_state.actions: --> 376 result = action(retry_state) 377 return result
File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:418, in BaseRetrying._post_stop_check_actions.
File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:185, in RetryError.reraise(self) 183 def reraise(self) -> t.NoReturn: 184 if self.last_attempt.failed: --> 185 raise self.last_attempt.result() 186 raise self
File /opt/conda/lib/python3.11/concurrent/futures/_base.py:449, in Future.result(self, timeout) 447 raise CancelledError() 448 elif self._state == FINISHED: --> 449 return self.__get_result() 451 self._condition.wait(timeout) 453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:
File /opt/conda/lib/python3.11/concurrent/futures/_base.py:401, in Future.__get_result(self) 399 if self._exception: 400 try: --> 401 raise self._exception 402 finally: 403 # Break a reference cycle with the exception in self._exception 404 self = None
File /opt/conda/lib/python3.11/site-packages/tenacity/init.py:478, in Retrying.call(self, fn, *args, *kwargs) 476 if isinstance(do, DoAttempt): 477 try: --> 478 result = fn(args, **kwargs) 479 except BaseException: # noqa: B902 480 retry_state.set_exception(sys.exc_info()) # type: ignore[arg-type]
Cell In[9], line 8, in assert_experiment_succeeded(client, experiment) 1 @retry( 2 wait=wait_exponential(multiplier=2, min=1, max=10), 3 stop=stop_after_attempt(30), 4 reraise=True, 5 ) 6 def assert_experiment_succeeded(client, experiment): 7 """Wait for the Katib Experiment to complete successfully.""" ----> 8 assert client.is_experiment_succeeded(name=experiment), f"Katib Experiment was not successful."
AssertionError: Katib Experiment was not successful.
while the experiment hasn't failed
╰─$ k logs -n admin cmaes-example-cmaes-5bd986458-vlmb4 -f I0716 12:26:28.166999 1 main.go:52] Start Goptuna suggestion service: 0.0.0.0:6789 I0716 12:26:48.748339 1 service.go:84] Success to sample new trial: trialID=0, assignments=[name:"lr" value:"0.04188612100654" name:"momentum" value:"0.7043612817216396"] I0716 12:26:48.748436 1 service.go:84] Success to sample new trial: trialID=1, assignments=[name:"lr" value:"0.04511033252270099" name:"momentum" value:"0.6980954001565728"] I0716 12:29:58.070186 1 service.go:117] Update trial mapping : trialName=cmaes-example-z7pnvrwl -> trialID=1 I0716 12:29:58.070571 1 service.go:147] Detect changes of Trial (trialName=cmaes-example-z7pnvrwl, trialID=1) : State Complete, Evaluation 0.269100 I0716 12:29:58.070807 1 service.go:117] Update trial mapping : trialName=cmaes-example-vqrxjvdp -> trialID=0 I0716 12:29:58.070995 1 service.go:84] Success to sample new trial: trialID=2, assignments=[name:"lr" value:"0.02556132716757138" name:"momentum" value:"0.701003503816815"]
Looking at the [outputted experiment](https://pastebin.canonical.com/p/BkQHHHMv55/) in the notebook's prints, we see that one trial is still running and that there's no failed trial. Increasing the timeout and rerunning from a new notebook (so we still need to download the same data), it looks like the UAT succeeds.
Since this was a timeout issue, I 'm sending a PR that increases the batch-size
to a large enough number so the experiment doesn't perform that much training and completes earlier. The purpose of the UATs are to confirm that the workloads are working rather than perform complete tasks with them.
Bug Description
Running the UATs on 1.9/beta fails for Katib.
To Reproduce
juju deploy kubeflow --channel 1.9/beta --trust
tox -ve kubeflow-local
following these stepsEnvironment
Relevant Log Output
Additional Context
No response