automl / neps

Neural Pipeline Search (NePS): Helps deep learning experts find the best neural pipeline.
https://automl.github.io/neps/
Apache License 2.0
61 stars 13 forks source link

[UX] Show error and traceback when something goes wrong #62

Closed eddiebergman closed 3 months ago

eddiebergman commented 7 months ago

While testing some new things, I get this error and it's got no information useful for understanding what went wrong.

/home/skantify/code/neps/neps_examples/efficiency/multi_fidelity.py:81: in <module>
    neps.run(
/home/skantify/code/neps/neps/api.py:273: in run
    metahyper_run(
/home/skantify/code/neps/neps/metahyper/api.py:586: in metahyper_run
    post_evaluation_hook(
/home/skantify/code/neps/neps/api.py:39: in _post_evaluation_hook
    loss = get_loss(result, loss_value_on_error, ignore_errors)

    def get_loss(
        result: str | dict | float,
        loss_value_on_error: float | None = None,
        ignore_errors: bool = False,
    ) -> float | Any:
        if result == "error":
            if ignore_errors:
                return "error"
            elif loss_value_on_error is None:
>               raise ValueError(
                    "An error happened during the execution of your run_pipeline function."
                    " You have three options: 1. If the error is expected and corresponds to"
                    " a loss value in your application (e.g., 0% accuracy), you can set"
                    " loss_value_on_error to some float. 2. If sometimes your pipeline"
                    " crashes randomly, you can set ignore_errors=True. 3. Fix your error."
                )
E               ValueError: An error happened during the execution of your run_pipeline function. You have three options: 1. If the error is expected and corresponds to a loss value in your application (e.g., 0% accuracy), you can set loss_value_on_error to some float. 2. If sometimes your pipeline crashes randomly, you can set ignore_errors=True. 3. Fix your error.

It did show something in the logs which is nice but I feel like these errors should get bubbled all the way up.

eddiebergman commented 3 months ago

This was fixed with #126. You now immediately get the error and traceback from the worker that evaluated the config that crashed. In this example, I included an error ValueError("something went wrong") inside the target function.

INFO:neps.api:Starting neps.run using root directory results/hyperparameters_example
INFO:neps.api:Running bayesian_optimization as the searcher
INFO:neps.api:Strategy: bayesian_optimization
INFO:neps.runtime:Launching NePS
INFO:neps.runtime:Worker '176609-2024-08-05T16:36:20.632810+00:00' sampled a new trial: Trial(config={'categorical': 1, 'float1': 0.9083763101496742, 'float2': -7.058519989984767, 'integer1': 1, 'integer2': 19}, metadata=MetaData(id='1', location='results/hyperparameters_example/configs/config_1', previous_trial_id=None, previous_trial_location=None, sampling_worker_id='176609-2024-08-05T16:36:20.632810+00:00', time_sampled=1722875780.6338768, evaluating_worker_id=None, evaluation_duration=None, time_submitted=None, time_started=None, time_end=None), state=<State.PENDING: 'pending'>, report=None)
ERROR:neps.state._eval:Error during evaluation of '1': {'categorical': 1, 'float1': 0.9083763101496742, 'float2': -7.058519989984767, 'integer1': 1, 'integer2': 19}.
ERROR:neps.state._eval:Something went wrong!
Traceback (most recent call last):
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/state/_eval.py", line 125, in _eval_trial
    user_result = fn(**kwargs, **trial.config)
  File "/home/skantify/code/neps/neps_examples/basic_usage/hyperparameters.py", line 11, in run_pipeline
    raise ValueError("Something went wrong!")
ValueError: Something went wrong!
INFO:neps.runtime:Worker '176609-2024-08-05T16:36:20.632810+00:00' evaluated trial: 1 as State.CRASHED.
ERROR:neps.runtime:Error during evaluation of '1' : {'categorical': 1, 'float1': 0.9083763101496742, 'float2': -7.058519989984767, 'integer1': 1, 'integer2': 19}.
ERROR:neps.runtime:Something went wrong!
NoneType: None
Traceback (most recent call last):
  File "/home/skantify/code/neps/neps_examples/basic_usage/hyperparameters.py", line 25, in <module>
    neps.run(
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/api.py", line 232, in run
    _launch_runtime(
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/runtime.py", line 534, in _launch_runtime
    worker.run()
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/runtime.py", line 356, in run
    should_stop = self._check_if_should_stop(
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/runtime.py", line 212, in _check_if_should_stop
    raise error_from_this_worker
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/state/_eval.py", line 125, in _eval_trial
    user_result = fn(**kwargs, **trial.config)
  File "/home/skantify/code/neps/neps_examples/basic_usage/hyperparameters.py", line 11, in run_pipeline
    raise ValueError("Something went wrong!")
ValueError: Something went wrong!

If you have another worker that is set to stop on any error occuring, you will also see the error, for example:

INFO:neps.api:Starting neps.run using root directory results/hyperparameters_example
INFO:neps.api:Running bayesian_optimization as the searcher
INFO:neps.api:Strategy: bayesian_optimization
INFO:neps.runtime:Launching NePS
Traceback (most recent call last):
  File "/home/skantify/code/neps/neps_examples/basic_usage/hyperparameters.py", line 25, in <module>
    neps.run(
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/api.py", line 232, in run
    _launch_runtime(
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/runtime.py", line 534, in _launch_runtime
    worker.run()
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/runtime.py", line 356, in run
    should_stop = self._check_if_should_stop(
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/runtime.py", line 269, in _check_if_should_stop
    raise err
neps.state.err_dump.SerializedError: An error occurred during the evaluation of a trial '1' which was evaluted by worker '176609-2024-08-05T16:36:20.632810+00:00'. The original error could not be deserialized but had the following information:
ValueError: Something went wrong!

Traceback (most recent call last):
  File "/home/skantify/code/wandb-neps/vendored/neps/neps/state/_eval.py", line 125, in _eval_trial
    user_result = fn(**kwargs, **trial.config)
  File "/home/skantify/code/neps/neps_examples/basic_usage/hyperparameters.py", line 11, in run_pipeline
    raise ValueError("Something went wrong!")
ValueError: Something went wrong!

@Neeratyoy this is what I meant by errors. I will make a small PR to include some information to the user on how to recover from this:

  1. If you had workers set to stop on any error (useful when debugging), delete the results and run again.
  2. If you set the worker to stop only if the crash occured in it's workload, then the solution is just to spawn a new worker.
  3. If you're not debugging on you want to just move to the next configuration, you should set workers to ignore errors.
eddiebergman commented 3 months ago

You can see the better output in #128 now