facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.
MIT License
262 stars 24 forks source link

How to run with torchrun? #52

Open StrongChris opened 11 months ago

StrongChris commented 11 months ago

❓ Questions

Torchrun is the standard recommended way to run multi gpu, multi machine training. How can one launch projects that are written to use dora using torchrun?

adefossez commented 11 months ago

Can you try torchrun -m yourproject.train [ARGS] ? I think it should work out of the box if not I'll have a look. You won't be able to use some of the Dora features like inserting the base config from an existing run with -f [SIG]. If all your machines are on slurm then it will definitely be easier to use the grid system. If training on a single machine juste run dora run -d

adefossez commented 10 months ago

I've checked this actually works, even on multi machines, see: https://github.com/facebookresearch/dora/blob/main/README.md#multi-node-training-without-slurm

MisakaMikoto96 commented 8 months ago

https://github.com/facebookresearch/dora/blob/main/README.md#multi-node-training-without-slurm

hi, how does the [ARGS] here mean?