ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡
4.29k stars 656 forks source link

Resume multi-run #535

Open nurlanov-zh opened 1 year ago

nurlanov-zh commented 1 year ago

Hi,

Is it possible to resume a multi-run? E.g. if the Optuna hyperparameter search has crashed, can we resume the search from that point without having to sample new runs?

ashleve commented 1 year ago

Not possible as far as I'm aware.

I think it's best to write a dedicated task / pipeline for hyperparameter search if you want to be able to resume.

tesfaldet commented 1 year ago

As someone who's implemented Hydra-aware resuming from pre-emption on multi-runs for hyperparameter search with both wandb's sweeper and another sweeper made by some colleagues, I can 100% agree with the suggestion of writing a dedicated pipeline for it. Each sweeper (and their Hydra plugins) operates quite differently and handle resuming from runs very differently. It would be completely infeasible to have this template cover the use case for all Hydra-supported sweepers. You would need to integrate this functionality both on this template and on the Hydra sweeper, i.e., within the sweeper plugin code. Take a look at this gross PR I made for getting it to work for wandb (among other features). It gets messy real fast.

nurlanov-zh commented 1 year ago

@tesfaldet thanks for the links! I will take a look