Open mrzzmrzz opened 1 year ago
Hi! Thanks for the bug report. I could run multi-node multi-GPU training in the past. Could you provide your PyTorch version?
From my understanding, NCCL is only implemented for GPUs. If we want to communicate any CPU tensor, the nccl group will complain there -- and that's why we additionally initialize a gloo group.
Hey, Here is my system environment setting for NCCL
NCCL_SOCKET_IFNAME=en,eth,em,bond
NCCL_P2P_DISABLE=0
LD_LABRARY_PATH=:/usr/local/cuda/lib64
NCCL_DEBUG=INFO
My PyTorch and CUDA version is torch 1.8.0+cu111
.
After disabling the CPU group, I can run the multi-node multi-GPU training successfully, and the final result seems normal.
I will take a close look.
Note most multi-node multi-GPU training don't require CPU tensor communication, so they should be fine. The only one used in TorchDrug is knowledge graph reasoning, since storing the intermediate results on GPU may overflow the GPU memory.
@Mrz-zz I can successfully run the distributed version of NBFNet, which involves both GPU (NCCL) and CPU (gloo) communications. The commands I used for two nodes are:
python -m torch.distributed.launch --nproc_per_nodes=4 --nnodes=2 --node_rank=0 --master_addr=xxx ...
python -m torch.distributed.launch --nproc_per_nodes=4 --nnodes=2 --node_rank=1 --master_addr=xxx ...
where master_addr
is the alias or IP address to the node of rank 0. I got the alias from import platform; print(platform.node())
. The two nodes are in the same LAN. My environment is based on PyTorch 1.8.1 and CUDA 11.2, so it's roughly the same.
I am confused about the reason for your case. There is also a similar question in the PyTorch forum, which suggests using netcat
command to test your network. Maybe you can have a try?
I guess something is wrong with the GLOO_SOCKET_IFNAME
. Can I refer to your OS environment parameter setting about the GLOO_SOCKET_IFNAME
, NCCL_P2P_DISABLE
and GLOO_SOCKET_IFNAME
?
Can I refer to your OS environment parameter setting about the
GLOO_SOCKET_IFNAME
,NCCL_P2P_DISABLE
andGLOO_SOCKET_IFNAME
?
All of them are empty in my environment.
Hey, I found a bug when using Distributed Data Parallel (DDP) training on different nodes.
I use 4 GPUs (two GPUs per node and I use two nodes at the same time). However, I cannot run the code successfully. But I can run two GPUs on one node successfully.
Here is the log:
Then I comment some codes in the
comm.py
, luckily I succeed in running the code.It seems like that when running on multiple nodes, the
init_process_group
method can't be initialized when creating the CPU dist_group bydist.new_group(backend="gloo")
.I am not sure whether the analysis is right, maybe you can think about this bug more comprehensively. Thank you for your work.