Test trial resumability with PBT & Hyperband

MaximilienLC commented 7 months ago

Hey, thanks for the great package!

I was wondering if you had any update on this issue. Is it supposedly currently possible to resume trials however this feature has not yet been properly tested?

bouthilx commented 7 months ago

Hi! PBT is a bit tricky to use with Hydra because it relies on checkpoints being copied from one trial to another while Hydra creates new working dir for each trial and sets them as working directories for the time of the trial execution. It should be possible to use PBT (and Hyperband with checkpointing) if you set your working dir explicitly in hydra config.

@Delaunay Is this something you tested yet?

Delaunay commented 7 months ago

As Bouthilx pointed out you need to control the directory so the checkpoint can be found between reruns.

Maybe something like this would work:

hydra:
  sweep:
    dir: multirun/
    subdir: ${hydra.sweeper.experiment.name}/${hydra.sweeper.experiment.paramhash}

So all the HPO run will end up in the same folder. It will create one folder per experiment name and one folder per trial parameter config. So it should be able to find the checkpoint of a given trial.

bouthilx commented 7 months ago

Actually, this would work for ASHA/Hyperband but not for PBT. When using PBT, the trial working directory which corresponds to ${trial.working_dir} is copied from the parent's trial to the current child trial. @Delaunay Do we have support for trial.working_dir in this plugin?

Delaunay commented 7 months ago

In the case of hydra, shouldn't trial.working_dir be the current working directory that hydra set ?

bouthilx commented 7 months ago

No, it's determined based on the experiment's working dir: https://github.com/Epistimio/orion/blob/develop/src/orion/core/worker/trial.py#L353

Delaunay commented 7 months ago

It can be easily added https://github.com/Epistimio/hydra_orion_sweeper/pull/35

Delaunay commented 7 months ago

@bouthilx With the latest version, I believe this should work for PBT

hydra:
  sweep:
    dir: multirun/${hydra.sweeper.experiment.name}
    subdir: ${hydra.sweeper.experiment.trial_working_dir}/

MaximilienLC commented 5 months ago

Hey sorry for the late reply, I tried making it work w/ this simple example:

defaults:
  - override hydra/sweeper: orion

hydra:
  sweep:
    dir: multirun/${hydra.sweeper.experiment.name}
    subdir: ${hydra.sweeper.experiment.trial_working_dir}
  sweeper:
    params:
      x: "uniform(-10, 10)"
      epoch: "fidelity(low=1, high=2, base=1)"
    algorithm:
      type: pbt
      config:
        seed: 0
        population_size: 5
        generations: 1
x: 0
epoch: 0

import hydra
from omegaconf import DictConfig

@hydra.main(config_path=".", config_name="config")
def main(cfg: DictConfig) -> float:
    result = (cfg.x * cfg.x) ** cfg.epoch
    with open(f"{cfg.x}+{int(cfg.epoch)}.txt", "w") as f:
        f.write(str(result))
    return result

if __name__ == "__main__":
    main()

However, every trial's trial_working_dir is different (and equal to trial) Example output hydra.yaml

...
trial: 4e2287fd0fedb2f7da85735f1599eff5
paramhash: b135edc909ac21e8304df5ca1bd363c5
uuid: 88208312c86311eeb13b0242ac110002
trial_working_dir: 4e2287fd0fedb2f7da85735f1599eff5
...

paramhash looks the same for trials that do not change parameters though.

bouthilx commented 5 months ago

This is expected. What should be happening is that Oríon copies over the dir from the parent trial to the child one, so that if you have a checkpoint there it is available in the child trial directory (happening here https://github.com/Epistimio/orion/blob/develop/src/orion/client/runner.py#L191). Do you see an empty directory instead?

@Delaunay Is the hydra plugin using orion's Runner? If not then it probably does not call the function prepare_trial_working_dir that is responsible for this copy from parent trial to child trial dir.

MaximilienLC commented 5 months ago

Yeah empty with ${hydra.sweeper.experiment.trial_working_dir} but not with ${hydra.sweeper.experiment.paramhash}.

Delaunay commented 5 months ago

No, it does not call the runner since Hydra has its own launcher thing that launch workers

MaximilienLC commented 5 months ago

Alrighty, do y'all think it can be worked around?

Delaunay commented 5 months ago

We would need to implement the copy for the algo here right before the experiment is launched

MaximilienLC commented 5 months ago

Got it, based on your responses I'm guessing that's not on the timeline. I'll make a PR to add a warning on the README that PBT-like algorithms aren't functional for now.

Epistimio / hydra_orion_sweeper

Test trial resumability with PBT & Hyperband #20