Open Xiaoming-Zhao opened 6 days ago
Thanks for the PR! I'm actually very curious about the error message you got without the --standalone
flag?
Also, did you install your environment recently? I had tested train_cifar10_ddp.py
with PyTorch 2.0 or 2.1 a few months ago, so if you have a newer version with which you are running into a problem, I'd be happy to give a newer version a try to see whether I can reproduce the error.
I used pytorch 2.4.0.
There were no errors but the process just hung forever.
It could be possible due to my server's setup. Happy to close this PR if this is the case.
Hmm, difficult to say... So locally, you have two GPUs with which you tried?
I myself had tried it on runpod.io with two GPUs, and there - after specifying the correct master address and port - the script worked without the --standalone
flag. In my experience, when a DDP process "hung forever", it was usually because of the wrong master port/adress.
What does this PR do?
This PR added a
--standalone
command line argument totorchrun
, without with I cannot make the scripttrain_cifar10_ddp.py
run locally. This also follows the official document.Before submitting
pytest
command?pre-commit run -a
command?Did you have fun?
Make sure you had fun coding 🙃