facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.
MIT License
262 stars 24 forks source link

Can I train on multiple machines? #45

Open ElizavetaSedova opened 1 year ago

ElizavetaSedova commented 1 year ago

❓ Questions

I am new to Dora. I see that I can run distributed training. But is it possible to deploy learning on multiple machines? I don’t see the possibility of adding master_addr, master_port, rank. Maybe you haven’t done it yet. Perhaps I did not notice it. But it would be very cool to have such a possibility! I would be very grateful for help and tips in this matter!

adefossez commented 1 year ago

It is possible with a Slurm cluster, but not manually though. It should be possible to add but that would take a few changes in the code.

adefossez commented 10 months ago

I've updated the instructions for having compatiblity with torchrun with multinode: https://github.com/facebookresearch/dora/blob/main/README.md#multi-node-training-without-slurm