Open NathanYanJing opened 1 year ago
Problems solved for now! In case people might encounter a similar issue,
If you use single node multiple GPU, replace the DDP with the following, there is a hacky way,
from torch.nn import DataParallel as DDP
Or you can try the following
torch.multiprocessing.set_start_method('spawn',force=True)
but you might need to rewrite the lambda function to avoid the pickle issue.
Hi @NathanYanJing. Your torchrun
command runs fine for me without any modifications to the code (also using a single-node, multi-GPU training setup). I haven't run across the error you're getting before. Depending on how you're launching the script, you might want to be a little careful with the DDP --> DataParallel change since that could change the behavior of parts of train.py
that rely on distributed ops (in general I'm not sure if DataParallel
plays nice with torch.distributed
)
Hi @wpeebles , thanks for your reply! Yea, I agree with using torch.distributed is always a better choice.
Yes, it seems the problem comes back again somehow now -- it hangs at Dataloader part. I am guessing that this is probably NCCL and Nvidia-version issue. Would you mind sharing your NCCL and Cuda versions?
Super cool and amazing work!
I am writing to ask for your assistance with an issue I am encountering while training a model using A6000 GPUs. I am using the following command to run my code:
The problem I am experiencing is that the training appears to be frozen after creating the experiment directory for a long period of time. On occasion, it also throws the following error:
I have not experienced this problem when training with 1, 2, or 3 nodes.
I apologize for my lack of experience in this area, but could you please provide any insights or guidance to help me resolve this issue? Thank you for your assistance.