Open Delaunay opened 1 year ago
Hi! PBT is a bit tricky to use with Hydra because it relies on checkpoints being copied from one trial to another while Hydra creates new working dir for each trial and sets them as working directories for the time of the trial execution. It should be possible to use PBT (and Hyperband with checkpointing) if you set your working dir explicitly in hydra config.
@Delaunay Is this something you tested yet?
As Bouthilx pointed out you need to control the directory so the checkpoint can be found between reruns.
Maybe something like this would work:
hydra:
sweep:
dir: multirun/
subdir: ${hydra.sweeper.experiment.name}/${hydra.sweeper.experiment.paramhash}
So all the HPO run will end up in the same folder. It will create one folder per experiment name and one folder per trial parameter config. So it should be able to find the checkpoint of a given trial.
Actually, this would work for ASHA/Hyperband but not for PBT. When using PBT, the trial working directory which corresponds to ${trial.working_dir}
is copied from the parent's trial to the current child trial. @Delaunay Do we have support for trial.working_dir
in this plugin?
In the case of hydra, shouldn't trial.working_dir
be the current working directory that hydra set ?
No, it's determined based on the experiment's working dir: https://github.com/Epistimio/orion/blob/develop/src/orion/core/worker/trial.py#L353
It can be easily added https://github.com/Epistimio/hydra_orion_sweeper/pull/35
@bouthilx With the latest version, I believe this should work for PBT
hydra:
sweep:
dir: multirun/${hydra.sweeper.experiment.name}
subdir: ${hydra.sweeper.experiment.trial_working_dir}/
Hey sorry for the late reply, I tried making it work w/ this simple example:
defaults:
- override hydra/sweeper: orion
hydra:
sweep:
dir: multirun/${hydra.sweeper.experiment.name}
subdir: ${hydra.sweeper.experiment.trial_working_dir}
sweeper:
params:
x: "uniform(-10, 10)"
epoch: "fidelity(low=1, high=2, base=1)"
algorithm:
type: pbt
config:
seed: 0
population_size: 5
generations: 1
x: 0
epoch: 0
import hydra
from omegaconf import DictConfig
@hydra.main(config_path=".", config_name="config")
def main(cfg: DictConfig) -> float:
result = (cfg.x * cfg.x) ** cfg.epoch
with open(f"{cfg.x}+{int(cfg.epoch)}.txt", "w") as f:
f.write(str(result))
return result
if __name__ == "__main__":
main()
However, every trial's trial_working_dir
is different (and equal to trial
)
Example output hydra.yaml
...
trial: 4e2287fd0fedb2f7da85735f1599eff5
paramhash: b135edc909ac21e8304df5ca1bd363c5
uuid: 88208312c86311eeb13b0242ac110002
trial_working_dir: 4e2287fd0fedb2f7da85735f1599eff5
...
paramhash
looks the same for trials that do not change parameters though.
This is expected. What should be happening is that Oríon copies over the dir from the parent trial to the child one, so that if you have a checkpoint there it is available in the child trial directory (happening here https://github.com/Epistimio/orion/blob/develop/src/orion/client/runner.py#L191). Do you see an empty directory instead?
@Delaunay Is the hydra plugin using orion's Runner
? If not then it probably does not call the function prepare_trial_working_dir
that is responsible for this copy from parent trial to child trial dir.
Yeah empty with ${hydra.sweeper.experiment.trial_working_dir}
but not with ${hydra.sweeper.experiment.paramhash}
.
No, it does not call the runner since Hydra has its own launcher thing that launch workers
Alrighty, do y'all think it can be worked around?
We would need to implement the copy for the algo here right before the experiment is launched
Got it, based on your responses I'm guessing that's not on the timeline. I'll make a PR to add a warning on the README that PBT-like algorithms aren't functional for now.
Hey, thanks for the great package!
I was wondering if you had any update on this issue. Is it supposedly currently possible to resume trials however this feature has not yet been properly tested?