ashleve / lightning-hydra-template

PyTorch Lightning + Hydra. A very user-friendly template for ML experimentation. ⚡🔥⚡
4.1k stars 638 forks source link

Multiple parallel runs on multi-GPU for hyperparameter search #514

Open nurlanov-zh opened 1 year ago

nurlanov-zh commented 1 year ago

Hi,

I have found that the --multirun option with multiple parameters does not work out of box on LSF cluster. E.g. requesting 3 gpus and running:

#BSUB -gpu "num=3"
...
python src/train.py --multirun datamodule.batch_size=125,250,500

results in sequential execution of python runs, where 2 GPUs are unused.

I have seen there is a solution for SLURM cluster. I am wondering if there is one for LSF cluster + Hydra?

ashleve commented 1 year ago

It seems like there's no LSF launcher in hydra yet.

You could either:

  1. Try writing your own hydra launcher. Example: https://github.com/facebookresearch/hydra/tree/main/examples/plugins/example_launcher_plugin
  2. Seems like lightning supports launching jobs on LFS so maybe some workaround could be found, where you just set the required environment variables by yourself and ask lightning to launch all jobs in subprocesses. Unfortunately I'm not familiar with LSF so can't help you here. https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.plugins.environments.LSFEnvironment.html
nurlanov-zh commented 1 year ago

Thanks @ashleve for the suggestions.

So far I managed to launch multi-GPU jobs in parallel using the combination of Ray launcher and Optuna sweeper. I configured Ray as in https://github.com/facebookresearch/hydra/issues/1974#issuecomment-1226185827 . And Optuna - by the example in this template.

It automatically finds allocated GPUs and runs jobs in parallel. However, after one of the jobs is finished, it does not start a new job. What might be the issue?

nurlanov-zh commented 1 year ago

The jobs wait for all jobs in a previous launch to finish. E.g. if there were 2 parallel runs and one of them quickly failed, it will be waiting for the other job to finish, and only after the new set of 2 parallel runs will be launched. The desired behavior would be: launching new jobs as soon as previous jobs are finished.

tesfaldet commented 1 year ago

This might answer your Q https://github.com/facebookresearch/hydra/issues/1377#issuecomment-773583397. In other words, this is a Hydra issue.