Open nurlanov-zh opened 1 year ago
It seems like there's no LSF launcher in hydra yet.
You could either:
Thanks @ashleve for the suggestions.
So far I managed to launch multi-GPU jobs in parallel using the combination of Ray launcher and Optuna sweeper. I configured Ray as in https://github.com/facebookresearch/hydra/issues/1974#issuecomment-1226185827 . And Optuna - by the example in this template.
It automatically finds allocated GPUs and runs jobs in parallel. However, after one of the jobs is finished, it does not start a new job. What might be the issue?
The jobs wait for all jobs in a previous launch to finish. E.g. if there were 2 parallel runs and one of them quickly failed, it will be waiting for the other job to finish, and only after the new set of 2 parallel runs will be launched. The desired behavior would be: launching new jobs as soon as previous jobs are finished.
This might answer your Q https://github.com/facebookresearch/hydra/issues/1377#issuecomment-773583397. In other words, this is a Hydra issue.
Hi,
I have found that the
--multirun
option with multiple parameters does not work out of box on LSF cluster. E.g. requesting 3 gpus and running:results in sequential execution of python runs, where 2 GPUs are unused.
I have seen there is a solution for SLURM cluster. I am wondering if there is one for LSF cluster + Hydra?