Open ElizavetaSedova opened 1 year ago
It is possible with a Slurm cluster, but not manually though. It should be possible to add but that would take a few changes in the code.
I've updated the instructions for having compatiblity with torchrun with multinode: https://github.com/facebookresearch/dora/blob/main/README.md#multi-node-training-without-slurm
❓ Questions
I am new to Dora. I see that I can run distributed training. But is it possible to deploy learning on multiple machines? I don’t see the possibility of adding master_addr, master_port, rank. Maybe you haven’t done it yet. Perhaps I did not notice it. But it would be very cool to have such a possibility! I would be very grateful for help and tips in this matter!