atong01 / conditional-flow-matching

TorchCFM: a Conditional Flow Matching library
https://arxiv.org/abs/2302.00482
MIT License
1.25k stars 101 forks source link

Fixed Issue for `torchrun` command for `train_cifar10_ddp.py` #149

Open Xiaoming-Zhao opened 6 days ago

Xiaoming-Zhao commented 6 days ago

What does this PR do?

This PR added a --standalone command line argument to torchrun, without with I cannot make the script train_cifar10_ddp.py run locally. This also follows the official document.

Before submitting

Did you have fun?

Make sure you had fun coding 🙃

ImahnShekhzadeh commented 6 days ago

Thanks for the PR! I'm actually very curious about the error message you got without the --standalone flag?

Also, did you install your environment recently? I had tested train_cifar10_ddp.py with PyTorch 2.0 or 2.1 a few months ago, so if you have a newer version with which you are running into a problem, I'd be happy to give a newer version a try to see whether I can reproduce the error.

Xiaoming-Zhao commented 6 days ago

I used pytorch 2.4.0.

There were no errors but the process just hung forever.

It could be possible due to my server's setup. Happy to close this PR if this is the case.

ImahnShekhzadeh commented 5 days ago

Hmm, difficult to say... So locally, you have two GPUs with which you tried? I myself had tried it on runpod.io with two GPUs, and there - after specifying the correct master address and port - the script worked without the --standalone flag. In my experience, when a DDP process "hung forever", it was usually because of the wrong master port/adress.