facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.81k stars 635 forks source link

[Bug] Optuna sweeper raises uncaught exception when run ends with nan value #2237

Open jbaczek opened 2 years ago

jbaczek commented 2 years ago

πŸ› Bug

Description

I perform hp search for a deep learning model and there are situations when model diverges and starts producing nans in it's output. My code exits gracefully returning tuple of nans as metrics. This is an expected behavior. But optuna sweeper doesn't think so.

  1. The problem starts here: https://github.com/facebookresearch/hydra/blob/main/plugins/hydra_optuna_sweeper/hydra_plugins/hydra_optuna_sweeper/_impl.py#L377 . Here values is a tuple of NaNs and state is COMPLETED (because the experiment code exited gracefully)
  2. study.tell raises an error here: https://github.com/optuna/optuna/blob/release-v2.10.0/optuna/study/study.py#L652 because of the check preformed here https://github.com/optuna/optuna/blob/release-v2.10.0/optuna/study/_optimize.py#L319
  3. Error is finally reraised here: https://github.com/facebookresearch/hydra/blob/main/plugins/hydra_optuna_sweeper/hydra_plugins/hydra_optuna_sweeper/_impl.py#L391 ,which ultimately crashes hydra.

Checklist

To reproduce

Minimal Code/Config snippet to reproduce

Stack trace/error message

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra_plugins/hydra_optuna_sweeper/_impl.py", line 237, in sweep
    study.tell(trial=trial, state=state, values=values)
  File "/opt/conda/lib/python3.8/site-packages/optuna/study/study.py", line 652, in tell
    raise ValueError(values_conversion_failure_message)
ValueError: Trial 0 failed, because the objective function returned nan.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/opt/conda/lib/python3.8/site-packages/hydra_plugins/hydra_optuna_sweeper/optuna_sweeper.py", line 42, in sweep
    return self.sweeper.sweep(arguments)
  File "/opt/conda/lib/python3.8/site-packages/hydra_plugins/hydra_optuna_sweeper/_impl.py", line 240, in sweep
    study.tell(trial=trial, state=state, values=values)
  File "/opt/conda/lib/python3.8/site-packages/optuna/study/study.py", line 592, in tell
    raise ValueError(
ValueError: Values were told. Values cannot be specified when state is TrialState.PRUNED or TrialState.FAIL.

Expected Behavior

If a code returns NaN, then mark trial as failed and proceed without crashing.

System information

Jasha10 commented 2 years ago

Thanks for the report!