NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

ProcessGroupNCCL.cpp:1191 #875

Open ZhengH-git opened 1 year ago

ZhengH-git commented 1 year ago

Details: Traceback (most recent call last): File "/gf3/home/lei/zhenghao/Autoplanner/test/manual_pp/pipeline2x4_ptip.py", line 178, in run_stage() File "/gf3/home/lei/zhenghao/Autoplanner/test/manual_pp/pipeline2x4_ptip.py", line 132, in run_stage output, activations = stage_model.pure_forward(inputs) File "/gf3/home/lei/zhenghao/Autoplanner/test/manual_pp/manual_transformer_2_8_4.py", line 240, in pure_forward AllGather1_output = self.AllGather1.forward(Reshape3_output) File "/gf3/home/lei/zhenghao/Autoplanner/autoplanner/algorithms/myalgorithm/EdgeCostModel/communication.py", line 257, in forward return AllGatherFunc.forward(None, tensor, self.tensor_comm_dim, self.process_group) File "/gf3/home/lei/zhenghao/Autoplanner/autoplanner/algorithms/myalgorithm/comm_op/comm_op.py", line 53, in forward dist.all_gather(buffer_list, inputs, group=group) File "/gf3/home/lei/anaconda3/envs/zh/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2070, in all_gather work = group.allgather([tensor_list], [tensor])

RuntimeError: NCCL error in: /home/builder/mc3/envs/pytorch-build/envs/pytorch-build/conda-bld/pytorch_1673601922403/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3

ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

sjeaugey commented 1 year ago

Please run again with NCCL_DEBUG=WARN set in the environment, so that NCCL prints an explicit message as to why it failed.

ZhengH-git commented 1 year ago

thank you for your advice. the reason is gn11:10467:10554 [3] include/socket.h:409 NCCL WARN Net : Connect to 192.180.16.30<33917> failed : Connection refused

Please run again with NCCL_DEBUG=WARN set in the environment, so that NCCL prints an explicit message as to why it failed.

sjeaugey commented 1 year ago

You should run again with NCCL_DEBUG=INFO, see which interface NCCL uses for oob (out-of-band socket communication for bootstrap) and set NCCL_SOCKET_IFNAME if necessary to use an interface which NCCL can use freely to communicate between tasks. Alternatively, if the interface NCCL is using is the one you want it to use, you may need to adjust the firewall to allow NCCL to open ports and have other ranks connect to them. Note that NCCL communication is not encrypted so if you run on a system exposed to the internet, you will want to create a secure VLAN just for NCCL communication, and which is not exposed from outside your job.

ZhengH-git commented 1 year ago

You should run again with NCCL_DEBUG=INFO, see which interface NCCL uses for oob (out-of-band socket communication for bootstrap) and set NCCL_SOCKET_IFNAME if necessary to use an interface which NCCL can use freely to communicate between tasks. Alternatively, if the interface NCCL is using is the one you want it to use, you may need to adjust the firewall to allow NCCL to open ports and have other ranks connect to them. Note that NCCL communication is not encrypted so if you run on a system exposed to the internet, you will want to create a secure VLAN just for NCCL communication, and which is not exposed from outside your job.


Actually, I've tried many methods recently, but still haven't been able to fix this bug. Thank you for your advice, I did experiments on two different clusters, but both had this problem.

The first cluster machine communicates with each other via the ib NIC, and the corresponding environment variables are set as follows: NCCL_IB_DISABLE=0; NCCL_P2P_DISABLE=0; NCCL_SOCKET_IFNAME=ib0 ('ib0' is got from ‘ifconfig’)

The second cluster machine communicates with each other not via the ib NIC, and the corresponding environment variables are set as follows: NCCL_IB_DISABLE=1; NCCL_P2P_DISABLE=1; NCCL_SOCKET_IFNAME=enp194s0f0 (which is got from ‘ifconfig’) On the second cluster, after checking, the firewall is not installed. So it should have nothing to do with the firewall status.

I may not have expressed our experiment clearly before. We use two 4-card servers for parallel experiments including pipeline parallelism. Inter-machine does pipeline parallel communication, and intra-machine does tensor parallel communication. However, on the second machine, when using pytorch's dist.all_gather() interface to communicate with AllGather between [4, 5], [6, 7], a "connection refused" error is reported. By checking the NCCL log message, we found that NCCL will try to make a socket link with another node when initializing the [4, 5], [6, 7] communication group and report a "connection refused" error. In summary, there are two main issues.

  1. Both communication groups [4, 5], [6, 7] are intra-node communication groups, why do they need to communicate with another node for socket communication during initialization?
  2. Why does "connection refused" appear?
sjeaugey commented 1 year ago

If you provide the full log with NCCL_DEBUG=INFO maybe I can try to spot what's wrong.

why do they need to communicate with another node for socket communication during initialization?

To make all cases the same. All bootstrap is done through sockets because using shared memory would need to re-implement the bootstrap completely and cause more bugs.

Why does "connection refused" appear?

Because NCCL tries to connect to another rank and fails. That means it tries to use a NIC which cannot communicate with the remote NIC. Why that is is unknown to me, but with the log hopefully I'll know more.

Hoteryoung commented 9 months ago

Hi, I've got the same error after I used Slurm to train the model in the multi-nodes multi-GPUS setting.

----------- log after setting NCCL_DEBUG=INFO --------------------- xxx-xxx-xxx-xxx:172719:172719 [0] NCCL INFO Bootstrap : Using bond0:10.140.28.100<0> xxx-xxx-xxx-xxx:172719:172719 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation xxx-xxx-xxx-xxx:172719:172719 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB bond0:10.140.28.100<0> xxx-xxx-xxx-xxx:172719:172719 [0] NCCL INFO Using network IB NCCL version 2.10.3+cuda11.3 xxx-xxx-xxx-xxx:162667:162667 [0] NCCL INFO Bootstrap : Using bond0:10.140.28.121<0> xxx-xxx-xxx-xxx:162667:162667 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation xxx-xxx-xxx-xxx:162667:162667 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB bond0:10.140.28.121<0> xxx-xxx-xxx-xxx:162667:162667 [0] NCCL INFO Using network IB xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Channel 00/02 : 0 1 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Channel 01/02 : 0 1 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO Channel 00 : 0[26000] -> 1[26000] [receive] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO Channel 01 : 0[26000] -> 1[26000] [receive] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Channel 00 : 1[26000] -> 0[26000] [receive] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO Channel 00 : 1[26000] -> 0[26000] [send] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Channel 01 : 1[26000] -> 0[26000] [receive] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO Channel 01 : 1[26000] -> 0[26000] [send] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Channel 00 : 0[26000] -> 1[26000] [send] via NET/IB/0/GDRDMA xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO Channel 01 : 0[26000] -> 1[26000] [send] via NET/IB/0/GDRDMA

xxx-xxx-xxx-xxx:172719:172823 [0] misc/ibvwrap.cc:268 NCCL WARN Call to ibv_create_cq failed xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO transport/net_ib.cc:358 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO transport/net_ib.cc:457 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO include/net.h:21 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO transport/net.cc:210 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO transport.cc:111 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO init.cc:778 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO init.cc:904 -> 2 xxx-xxx-xxx-xxx:172719:172823 [0] NCCL INFO group.cc:72 -> 2 [Async thread]

xxx-xxx-xxx-xxx:162667:162769 [0] misc/ibvwrap.cc:268 NCCL WARN Call to ibv_create_cq failed xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO transport/net_ib.cc:358 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO transport/net_ib.cc:457 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO include/net.h:21 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO transport/net.cc:210 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO transport.cc:111 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO init.cc:778 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO init.cc:904 -> 2 xxx-xxx-xxx-xxx:162667:162769 [0] NCCL INFO group.cc:72 -> 2 [Async thread]

Traceback (most recent call last): File "main_teacher.py", line 313, in init_dist_slurm(backend="nccl", port=port, init_backend="torch") File "main_teacher.py", line 281, in init_dist_slurm dist.barrier() File "/yyy/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system err or, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. Traceback (most recent call last): File "main_teacher.py", line 313, in init_dist_slurm(backend="nccl", port=port, init_backend="torch") File "main_teacher.py", line 281, in init_dist_slurm dist.barrier() File "yyy/anaconda3/envs/SimMIM/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier work = default_pg.barrier(opts=opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484810403/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system err or, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

taekyounghan commented 3 months ago

You should run again with NCCL_DEBUG=INFO, see which interface NCCL uses for oob (out-of-band socket communication for bootstrap) and set NCCL_SOCKET_IFNAME if necessary to use an interface which NCCL can use freely to communicate between tasks. Alternatively, if the interface NCCL is using is the one you want it to use, you may need to adjust the firewall to allow NCCL to open ports and have other ranks connect to them. Note that NCCL communication is not encrypted so if you run on a system exposed to the internet, you will want to create a secure VLAN just for NCCL communication, and which is not exposed from outside your job.

Actually, I've tried many methods recently, but still haven't been able to fix this bug.

Thank you for your advice, I did experiments on two different clusters, but both had this problem.

The first cluster machine communicates with each other via the ib NIC, and the corresponding environment variables are set as follows:

NCCL_IB_DISABLE=0; NCCL_P2P_DISABLE=0; NCCL_SOCKET_IFNAME=ib0 ('ib0' is got from ‘ifconfig’)

The second cluster machine communicates with each other not via the ib NIC, and the corresponding environment variables are set as follows:

NCCL_IB_DISABLE=1; NCCL_P2P_DISABLE=1; NCCL_SOCKET_IFNAME=enp194s0f0 (which is got from ‘ifconfig’) On the second cluster, after checking, the firewall is not installed. So it should have nothing to do with the firewall status. I may not have expressed our experiment clearly before. We use two 4-card servers for parallel experiments including pipeline parallelism. Inter-machine does pipeline parallel communication, and intra-machine does tensor parallel communication. However, on the second machine, when using pytorch's dist.all_gather() interface to communicate with AllGather between [4, 5], [6, 7], a "connection refused" error is reported. By checking the NCCL log message, we found that NCCL will try to make a socket link with another node when initializing the [4, 5], [6, 7] communication group and report a "connection refused" error. In summary, there are two main issues.

  1. Both communication groups [4, 5], [6, 7] are intra-node communication groups, why do they need to communicate with another node for socket communication during initialization?
  2. Why does "connection refused" appear?

Hello, @ZhengH-git

Can you share your Inter-node Pipeline parallel + Intra-node Tensor parallel code?

It is hard to find Inter-node Pipeline parallel code by the way...

Best regards, Taekyoung