Epistimio / orion

Asynchronous Distributed Hyperparameter Optimization.
https://orion.readthedocs.io
Other
285 stars 52 forks source link

Hyperband rng initialization seems broken #1002

Open legaultmarc opened 2 years ago

legaultmarc commented 2 years ago

When running orion hunt, the first iteration typically works fine, but I get the following traceback on the second iteration:

Traceback (most recent call last):
  File "/home/legaultm/mlenv3/bin/orion", line 8, in <module>
    sys.exit(main())
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/__init__.py", line 36, in main
    return orion_parser.execute(argv)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/base.py", line 110, in execute
    returncode = function(args)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/hunt.py", line 209, in main
    workon(experiment, ignore_code_changes=ignore_code_changes, **worker_config)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/cli/hunt.py", line 163, in workon
    client.workon(
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/experiment.py", line 810, in workon
    rval = runner.run()
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/runner.py", line 306, in run
    gathered = self.gather()
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/runner.py", line 409, in gather
    self.client.observe(trial, result.value)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/client/experiment.py", line 619, in observe
    self._producer.observe(trial)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/worker/producer.py", line 38, in observe
    algorithm.observe([trial])
  File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/worker/experiment.py", line 465, in acquire_algorithm_lock
    locked_algorithm_state.set_state(self.algorithms.state_dict)
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/core/worker/primary_algo.py", line 103, in state_dict
    "algorithm": self.algorithm.state_dict,
  File "/home/legaultm/mlenv3/lib/python3.8/site-packages/orion/algo/hyperband.py", line 285, in state_dict
    "rng_state": self.rng.get_state(),
AttributeError: 'Hyperband' object has no attribute 'rng'

I am under the impression that self.seed_rng() is not called properly to initialize the self.rng attribute causing this error.

Expected behavior I don't think this AttributeError should happen.

Steps to reproduce In my case, this happens after calling orion hunt on a single machine.

Environment (please complete the following information):

Additional context Here is my Orion config file:

database:
  host: /home/legaultm/.local/share/orion.core/orion/orion_db.pkl
  type: pickleddb

experiment:
  algorithms:
    hyperband:
      seed: 42
      repetitions: 1

evc:
  enable: True

Possible solution I initialize the seed in my config, and this seems to fix the problem for the first iteration, but not for subsequent iterations. If I force a call to self.seed_rng() in the Hyperband class init, I seem to be able to circumvent the problem. I'm not sure what's the right fix for this.

bouthilx commented 2 years ago

Hi @legaultmarc, thanks for the detailed bug report! We will look into this asap.

bouthilx commented 2 years ago

I did not manage to reproduce the issue using your config file, but looking at the code I can see that using no seed would cause this issue. Removing the seed from your config causes the issue on my side. Did you run without a seed before? We'll fix the issue when there are no seeds, but I'd like to be sure that there are no other corner cases that we are missing.

legaultmarc commented 2 years ago

It seems that now even I can't reproduce this bug. I was trying out different algorithms when I first encountered this bug so maybe it was due to a weird state in the config/database or some other Python caching? I too now only get it when the seed is set to null in the config. I'll report back here if it happens again...

Thanks for your rapid response :)