facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.
MIT License
269 stars 24 forks source link

Can we train with dora on multiple machines without SLURM? #55

Closed asifjalal closed 11 months ago

asifjalal commented 1 year ago

Is it possible to use dora with horovod or pytorch DDP? If so, is there any documentation/codebase available?

adefossez commented 1 year ago

For single node, just use the -d flag. For multi node, you can follow the instructions there: https://github.com/facebookresearch/dora/blob/main/README.md#multi-node-training-without-slurm

asifjalal commented 11 months ago

Thanks @adefossez , it worked!