NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Network is unreachable #185

Open chgdragon2023 opened 8 months ago

chgdragon2023 commented 8 months ago

Hi, after I set up RDMA and host network, I ran ib_write_bw to RDMA is working properly between the two servers (using --cuda=xx also). I also verify mpi that works properly between the server. However, when I tried to run nccl-test with the following command: /opt/openmpi/bin/mpirun --allow-run-as-root \ --np 2 \ --host 10.5.128.35:1,10.5.128.34:1 \ --mca btl_tcp_if_include bond0 \ --mca coll_hcoll_enable 0 \ --mca pml ob1 \ --mca btl tcp,self \ -x NCCL_IB_HCA=mlx5_10,mlx5_11 \ -x NCCL_SOCKET_IFNAME=enp220s0np0 \ -x NCCL_TESTS_DEVICE=0 \ -x NCCL_DEBUG_FILE=/root/test/%h.%p.nccl.log \ -x PATH -x NCCL_ALGO=RING \ -x NCCL_IB_GID_INDX=3 \ -x CUDA_VISIBLE_DEVICES=6,7 \ -x NCCL_P2P_DISABLE=1 \ -x NCCL_SHM_DISABLE=1 \ -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH \ -x NCCL_IB_QPS_PER_CONNECTION=2 \ ./nccltest/nccl-tests-2.13.0/build/all_reduce_perf -b 2G -e 2G -f 2 -g 1 -t 1 -c 0 -n 1

I always got error message:

[1702077406.818122] [dgxh100-4:130602:0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)

nThread 1 nGpus 1 minBytes 2147483648 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 1 agg iters: 1 validation: 0 graph: 0

#

Using devices

Rank 0 Pid 130602 on dgxh100-4 device 0 [0xd1] NVIDIA H100 80GB HBM3

Rank 1 Pid 225140 on dgxh100-3 device 0 [0xd1] NVIDIA H100 80GB HBM3

NCCL version 2.19.3+cuda12.3 dgxh100-3:225140:225140 [0] NCCL INFO cudaDriverVersion 12000 dgxh100-3:225140:225140 [0] NCCL INFO Bootstrap : Using enp220s0np0:11.1.128.128<0> dgxh100-3:225140:225140 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory dgxh100-3:225140:225140 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation dgxh100-3:225140:225289 [0] NCCL INFO NET/IB : Using [0]mlx5_10:1/RoCE [1]mlx5_11:1/RoCE [RO]; OOB enp220s0np0:11.1.128.128<0> dgxh100-3:225140:225289 [0] NCCL INFO Using network IB dgxh100-3:225140:225289 [0] NCCL INFO comm 0x55d878758ea0 rank 1 nranks 2 cudaDev 0 nvmlDev 6 busId d1000 commId 0xba9a3fd7f3b46223 - Init START dgxh100-3:225140:225289 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC dgxh100-3:225140:225289 [0] NCCL INFO === System : maxBw 48.0 totalBw 360.0 === dgxh100-3:225140:225289 [0] NCCL INFO CPU/1 (1/1/2) dgxh100-3:225140:225289 [0] NCCL INFO + PCI[48.0] - PCI/CC000 (15b3197900000000) dgxh100-3:225140:225289 [0] NCCL INFO + PCI[48.0] - NIC/CE000 dgxh100-3:225140:225289 [0] NCCL INFO + NET[50.0] - NET/0 (8bcd90003ae6d94/1/50.000000) dgxh100-3:225140:225289 [0] NCCL INFO + PCI[48.0] - PCI/CF000 (15b3197900000000) dgxh100-3:225140:225289 [0] NCCL INFO + PCI[48.0] - GPU/D1000 (1) dgxh100-3:225140:225289 [0] NCCL INFO + NVL[360.0] - NVS/0 dgxh100-3:225140:225289 [0] NCCL INFO + PCI[48.0] - PCI/DA000 (15b3197900000000) dgxh100-3:225140:225289 [0] NCCL INFO + PCI[48.0] - NIC/DC000 dgxh100-3:225140:225289 [0] NCCL INFO + NET[50.0] - NET/1 (bcd90003ae6d94/1/50.000000) dgxh100-3:225140:225289 [0] NCCL INFO ========================================== dgxh100-3:225140:225289 [0] NCCL INFO GPU/D1000 :GPU/D1000 (0/5000.000000/LOC) NVS/0 (1/360.000000/NVL) CPU/1 (3/48.000000/PHB) NET/0 (4/48.000000/PXB) NET/1 (6/48.000000/PHB) dgxh100-3:225140:225289 [0] NCCL INFO NET/0 :GPU/D1000 (4/48.000000/PXB) CPU/1 (3/48.000000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (6/48.000000/PHB) dgxh100-3:225140:225289 [0] NCCL INFO NET/1 :GPU/D1000 (6/48.000000/PHB) CPU/1 (3/48.000000/PHB) NET/0 (6/48.000000/PHB) NET/1 (0/5000.000000/LOC) dgxh100-3:225140:225289 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 24.000000/24.000000, type LOC/PXB, sameChannels 1 dgxh100-3:225140:225289 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0 dgxh100-3:225140:225289 [0] NCCL INFO 1 : NET/0 GPU/1 NET/0 dgxh100-3:225140:225289 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 48.000000/24.000000, type LOC/PXB, sameChannels 1 dgxh100-3:225140:225289 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0 dgxh100-3:225140:225289 [0] NCCL INFO 1 : NET/0 GPU/1 NET/0 dgxh100-3:225140:225289 [0] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 dgxh100-3:225140:225289 [0] NCCL INFO Tree 2 : -1 -> 1 -> 0/-1/-1 dgxh100-3:225140:225289 [0] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 dgxh100-3:225140:225289 [0] NCCL INFO Tree 3 : -1 -> 1 -> 0/-1/-1 dgxh100-3:225140:225289 [0] NCCL INFO Ring 00 : 0 -> 1 -> 0 dgxh100-3:225140:225289 [0] NCCL INFO Ring 01 : 0 -> 1 -> 0 dgxh100-3:225140:225289 [0] NCCL INFO Ring 02 : 0 -> 1 -> 0 dgxh100-3:225140:225289 [0] NCCL INFO Ring 03 : 0 -> 1 -> 0 dgxh100-3:225140:225289 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 dgxh100-3:225140:225289 [0] NCCL INFO P2P Chunksize set to 131072 dgxh100-3:225140:225289 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 00/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 01/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 02/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-3:225140:225289 [0] NCCL INFO Channel 03/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA

dgxh100-3:225140:225355 [0] misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Network is unreachable dgxh100-3:225140:225355 [0] NCCL INFO transport/net_ib.cc:579 -> 2 dgxh100-3:225140:225355 [0] NCCL INFO transport/net_ib.cc:786 -> 2 dgxh100-3:225140:225355 [0] NCCL INFO transport/net.cc:728 -> 2 dgxh100-3:225140:225355 [0] NCCL INFO proxy.cc:1306 -> 2 dgxh100-3:225140:225355 [0] NCCL INFO proxy.cc:1377 -> 2

dgxh100-3:225140:225355 [0] proxy.cc:1519 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2

dgxh100-3:225140:225289 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer dgxh100-3<41109> dgxh100-3:225140:225289 [0] NCCL INFO misc/socket.cc:749 -> 6

dgxh100-3:225140:225289 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7fa0242685e0 dgxh100-3:225140:225289 [0] NCCL INFO transport/net.cc:362 -> 3 dgxh100-3:225140:225289 [0] NCCL INFO transport.cc:168 -> 3 dgxh100-3:225140:225289 [0] NCCL INFO init.cc:1079 -> 3 dgxh100-3:225140:225289 [0] NCCL INFO init.cc:1358 -> 3 dgxh100-3:225140:225289 [0] NCCL INFO group.cc:65 -> 3 [Async thread] dgxh100-3:225140:225140 [0] NCCL INFO group.cc:406 -> 3 dgxh100-3:225140:225140 [0] NCCL INFO group.cc:96 -> 3 dgxh100-3: Test NCCL failure common.cu:908 'internal error - please report this issue to the NCCL developers / Socket recv failed while polling for opId=0x7fa0242685e0' .. dgxh100-3 pid 225140: Test failure common.cu:806 dgxh100-4: Test NCCL failure common.cu:908 'remote process exited or there was a network error / socketProgressOpt: Call to recv from 11.1.128.128<56179> failed : Connection reset by peer' .. dgxh100-4 pid 130602: Test failure common.cu:806

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[13029,1],0] Exit code: 3

AddyLaddy commented 8 months ago

The RoCE GID needs to be set with NCCL_IB_GID_INDEX=3 - it looks like you have a typo in your script. You may also need to set NCCL_IB_TC based on your network configuration or CSP recommendations.

I also recommend setting NCCL_IB_SPLIT_DATA_ON_QPS=0 when setting NCCL_IB_QPS_PER_CONNECTION

chgdragon2023 commented 8 months ago

thanks for helping. Fix the typo in the command and add NCCL_IB_SPLIT_DATA_ON_QPS=0, still get an error

/opt/openmpi/bin/mpirun --allow-run-as-root \ --np 2 \ --host 10.5.128.35:1,10.5.128.34:1 \ --mca btl_tcp_if_include bond0 \ --mca oob_tcp_if_include bond0 \ --mca coll_hcoll_enable 0 \ --mca pml ob1 \ --mca btl tcp,self \ -x NCCL_IB_HCA=mlx5_10,mlx5_11 \ -x NCCL_SOCKET_IFNAME=enp220s0np0 \ -x NCCL_TESTS_DEVICE=0 \ -x PATH -x NCCL_ALGO=RING \ -x NCCL_IB_GID_INDEX=3 \ -x NCCL_IB_SPLIT_DATA_ON_QPS=0 \ -x CUDA_VISIBLE_DEVICES=6,7 \ -x NCCL_P2P_DISABLE=1 \ -x NCCL_SHM_DISABLE=1 \ -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH \ -x NCCL_IB_QPS_PER_CONNECTION=2 \ ./nccltest/nccl-tests-2.13.0/build/all_reduce_perf -b 2G -e 2G -f 2 -g 1 -t 1 -c 0 -n 1 [1702106665.058985] [dgxh100-4:460770:0] ucp_context.c:1774 UCX WARN UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)

nThread 1 nGpus 1 minBytes 2147483648 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 1 agg iters: 1 validation: 0 graph: 0

#

Using devices

Rank 0 Pid 460770 on dgxh100-4 device 0 [0xdf] NVIDIA H100 80GB HBM3

Rank 1 Pid 2152176 on dgxh100-3 device 0 [0xd1] NVIDIA H100 80GB HBM3

dgxh100-4:460770:460770 [0] NCCL INFO Bootstrap : Using enp220s0np0:11.1.128.136<0> dgxh100-4:460770:460770 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dgxh100-4:460770:460770 [0] NCCL INFO cudaDriverVersion 12000 NCCL version 2.19.3+cuda12.3 dgxh100-4:460770:460842 [0] NCCL INFO NET/IB : Using [0]mlx5_10:1/RoCE [1]mlx5_11:1/RoCE [RO]; OOB enp220s0np0:11.1.128.136<0> dgxh100-4:460770:460842 [0] NCCL INFO Using non-device net plugin version 0 dgxh100-4:460770:460842 [0] NCCL INFO Using network IB dgxh100-3:2152176:2152176 [0] NCCL INFO cudaDriverVersion 12000 dgxh100-3:2152176:2152176 [0] NCCL INFO Bootstrap : Using enp220s0np0:11.1.128.128<0> dgxh100-3:2152176:2152176 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory dgxh100-3:2152176:2152176 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation dgxh100-3:2152176:2152299 [0] NCCL INFO NET/IB : Using [0]mlx5_10:1/RoCE [1]mlx5_11:1/RoCE [RO]; OOB enp220s0np0:11.1.128.128<0> dgxh100-3:2152176:2152299 [0] NCCL INFO Using network IB dgxh100-4:460770:460842 [0] NCCL INFO comm 0x55d26f10eca0 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId df000 commId 0x7dbcc5c09ec4fa59 - Init START dgxh100-3:2152176:2152299 [0] NCCL INFO comm 0x558b82c07680 rank 1 nranks 2 cudaDev 0 nvmlDev 6 busId d1000 commId 0x7dbcc5c09ec4fa59 - Init START dgxh100-4:460770:460842 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC dgxh100-4:460770:460842 [0] NCCL INFO === System : maxBw 48.0 totalBw 360.0 === dgxh100-4:460770:460842 [0] NCCL INFO CPU/1 (1/1/2) dgxh100-4:460770:460842 [0] NCCL INFO + PCI[48.0] - PCI/CC000 (15b3197900000000) dgxh100-4:460770:460842 [0] NCCL INFO + PCI[48.0] - NIC/CE000 dgxh100-4:460770:460842 [0] NCCL INFO + NET[50.0] - NET/0 (742ec90003ae6d94/1/50.000000) dgxh100-4:460770:460842 [0] NCCL INFO + PCI[48.0] - PCI/DA000 (15b3197900000000) dgxh100-4:460770:460842 [0] NCCL INFO + PCI[48.0] - NIC/DC000 dgxh100-4:460770:460842 [0] NCCL INFO + NET[50.0] - NET/1 (6c2ec90003ae6d94/1/50.000000) dgxh100-4:460770:460842 [0] NCCL INFO + PCI[48.0] - PCI/DD000 (15b3197900000000) dgxh100-4:460770:460842 [0] NCCL INFO + PCI[48.0] - GPU/DF000 (0) dgxh100-4:460770:460842 [0] NCCL INFO + NVL[360.0] - NVS/0 dgxh100-4:460770:460842 [0] NCCL INFO ========================================== dgxh100-4:460770:460842 [0] NCCL INFO GPU/DF000 :GPU/DF000 (0/5000.000000/LOC) NVS/0 (1/360.000000/NVL) CPU/1 (3/48.000000/PHB) NET/0 (6/48.000000/PHB) NET/1 (4/48.000000/PXB) dgxh100-4:460770:460842 [0] NCCL INFO NET/0 :GPU/DF000 (6/48.000000/PHB) CPU/1 (3/48.000000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (6/48.000000/PHB) dgxh100-4:460770:460842 [0] NCCL INFO NET/1 :GPU/DF000 (4/48.000000/PXB) CPU/1 (3/48.000000/PHB) NET/0 (6/48.000000/PHB) NET/1 (0/5000.000000/LOC) dgxh100-4:460770:460842 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 24.000000/24.000000, type LOC/PXB, sameChannels 1 dgxh100-4:460770:460842 [0] NCCL INFO 0 : NET/1 GPU/0 NET/1 dgxh100-4:460770:460842 [0] NCCL INFO 1 : NET/1 GPU/0 NET/1 dgxh100-4:460770:460842 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 48.000000/24.000000, type LOC/PXB, sameChannels 1 dgxh100-4:460770:460842 [0] NCCL INFO 0 : NET/1 GPU/0 NET/1 dgxh100-4:460770:460842 [0] NCCL INFO 1 : NET/1 GPU/0 NET/1 dgxh100-3:2152176:2152299 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC dgxh100-3:2152176:2152299 [0] NCCL INFO === System : maxBw 48.0 totalBw 360.0 === dgxh100-3:2152176:2152299 [0] NCCL INFO CPU/1 (1/1/2) dgxh100-3:2152176:2152299 [0] NCCL INFO + PCI[48.0] - PCI/CC000 (15b3197900000000) dgxh100-3:2152176:2152299 [0] NCCL INFO + PCI[48.0] - NIC/CE000 dgxh100-3:2152176:2152299 [0] NCCL INFO + NET[50.0] - NET/0 (8bcd90003ae6d94/1/50.000000) dgxh100-3:2152176:2152299 [0] NCCL INFO + PCI[48.0] - PCI/CF000 (15b3197900000000) dgxh100-3:2152176:2152299 [0] NCCL INFO + PCI[48.0] - GPU/D1000 (1) dgxh100-3:2152176:2152299 [0] NCCL INFO + NVL[360.0] - NVS/0 dgxh100-3:2152176:2152299 [0] NCCL INFO + PCI[48.0] - PCI/DA000 (15b3197900000000) dgxh100-3:2152176:2152299 [0] NCCL INFO + PCI[48.0] - NIC/DC000 dgxh100-3:2152176:2152299 [0] NCCL INFO + NET[50.0] - NET/1 (bcd90003ae6d94/1/50.000000) dgxh100-3:2152176:2152299 [0] NCCL INFO ========================================== dgxh100-3:2152176:2152299 [0] NCCL INFO GPU/D1000 :GPU/D1000 (0/5000.000000/LOC) NVS/0 (1/360.000000/NVL) CPU/1 (3/48.000000/PHB) NET/0 (4/48.000000/PXB) NET/1 (6/48.000000/PHB) dgxh100-3:2152176:2152299 [0] NCCL INFO NET/0 :GPU/D1000 (4/48.000000/PXB) CPU/1 (3/48.000000/PHB) NET/0 (0/5000.000000/LOC) NET/1 (6/48.000000/PHB) dgxh100-3:2152176:2152299 [0] NCCL INFO NET/1 :GPU/D1000 (6/48.000000/PHB) CPU/1 (3/48.000000/PHB) NET/0 (6/48.000000/PHB) NET/1 (0/5000.000000/LOC) dgxh100-3:2152176:2152299 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 24.000000/24.000000, type LOC/PXB, sameChannels 1 dgxh100-3:2152176:2152299 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0 dgxh100-3:2152176:2152299 [0] NCCL INFO 1 : NET/0 GPU/1 NET/0 dgxh100-3:2152176:2152299 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 48.000000/24.000000, type LOC/PXB, sameChannels 1 dgxh100-3:2152176:2152299 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0 dgxh100-3:2152176:2152299 [0] NCCL INFO 1 : NET/0 GPU/1 NET/0 dgxh100-3:2152176:2152299 [0] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1 dgxh100-3:2152176:2152299 [0] NCCL INFO Tree 2 : -1 -> 1 -> 0/-1/-1 dgxh100-3:2152176:2152299 [0] NCCL INFO Tree 1 : 0 -> 1 -> -1/-1/-1 dgxh100-3:2152176:2152299 [0] NCCL INFO Tree 3 : -1 -> 1 -> 0/-1/-1 dgxh100-3:2152176:2152299 [0] NCCL INFO Ring 00 : 0 -> 1 -> 0 dgxh100-3:2152176:2152299 [0] NCCL INFO Ring 01 : 0 -> 1 -> 0 dgxh100-3:2152176:2152299 [0] NCCL INFO Ring 02 : 0 -> 1 -> 0 dgxh100-3:2152176:2152299 [0] NCCL INFO Ring 03 : 0 -> 1 -> 0 dgxh100-3:2152176:2152299 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 dgxh100-3:2152176:2152299 [0] NCCL INFO P2P Chunksize set to 131072 dgxh100-4:460770:460842 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1 dgxh100-4:460770:460842 [0] NCCL INFO Tree 2 : 1 -> 0 -> -1/-1/-1 dgxh100-4:460770:460842 [0] NCCL INFO Tree 1 : -1 -> 0 -> 1/-1/-1 dgxh100-4:460770:460842 [0] NCCL INFO Tree 3 : 1 -> 0 -> -1/-1/-1 dgxh100-4:460770:460842 [0] NCCL INFO Channel 00/04 : 0 1 dgxh100-4:460770:460842 [0] NCCL INFO Channel 01/04 : 0 1 dgxh100-4:460770:460842 [0] NCCL INFO Channel 02/04 : 0 1 dgxh100-4:460770:460842 [0] NCCL INFO Channel 03/04 : 0 1 dgxh100-4:460770:460842 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1 dgxh100-4:460770:460842 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1 dgxh100-4:460770:460842 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1 dgxh100-4:460770:460842 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1 dgxh100-4:460770:460842 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 dgxh100-4:460770:460842 [0] NCCL INFO P2P Chunksize set to 131072 dgxh100-4:460770:460842 [0] NCCL INFO Channel 00/0 : 1[6] -> 0[6] [receive] via NET/IB/1/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 01/0 : 1[6] -> 0[6] [receive] via NET/IB/1/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[6] [receive] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 00/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 01/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 02/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-3:2152176:2152299 [0] NCCL INFO Channel 03/0 : 1[6] -> 0[6] [send] via NET/IB/0/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 02/0 : 1[6] -> 0[6] [receive] via NET/IB/1/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 03/0 : 1[6] -> 0[6] [receive] via NET/IB/1/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[6] [send] via NET/IB/1/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[6] [send] via NET/IB/1/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[6] [send] via NET/IB/1/GDRDMA dgxh100-4:460770:460842 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[6] [send] via NET/IB/1/GDRDMA

dgxh100-3:2152176:2152524 [0] misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument dgxh100-3:2152176:2152524 [0] NCCL INFO transport/net_ib.cc:579 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO transport/net_ib.cc:786 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO transport/net.cc:728 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO proxy.cc:1306 -> 2

dgxh100-3:2152176:2152524 [0] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=2, closing connection

dgxh100-3:2152176:2152524 [0] misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument dgxh100-3:2152176:2152524 [0] NCCL INFO transport/net_ib.cc:579 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO transport/net_ib.cc:786 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO transport/net.cc:728 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO proxy.cc:1306 -> 2 dgxh100-3:2152176:2152524 [0] NCCL INFO proxy.cc:1377 -> 2

dgxh100-3:2152176:2152524 [0] proxy.cc:1519 NCCL WARN [Proxy Service 1] Failed to execute operation Connect from rank 1, retcode 2

dgxh100-3:2152176:2152299 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer dgxh100-3<49931> dgxh100-3:2152176:2152299 [0] NCCL INFO misc/socket.cc:749 -> 6

dgxh100-3:2152176:2152299 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f5e4c27c8e0 dgxh100-3:2152176:2152299 [0] NCCL INFO transport/net.cc:288 -> 3 dgxh100-3:2152176:2152299 [0] NCCL INFO transport.cc:148 -> 3 dgxh100-3:2152176:2152299 [0] NCCL INFO init.cc:1079 -> 3 dgxh100-3:2152176:2152299 [0] NCCL INFO init.cc:1358 -> 3 dgxh100-3:2152176:2152299 [0] NCCL INFO group.cc:65 -> 3 [Async thread] dgxh100-3:2152176:2152176 [0] NCCL INFO group.cc:406 -> 3 dgxh100-3:2152176:2152176 [0] NCCL INFO group.cc:96 -> 3 dgxh100-3: Test NCCL failure common.cu:908 'internal error - please report this issue to the NCCL developers / Socket recv failed while polling for opId=0x7f5e4c27c8e0' .. dgxh100-3 pid 2152176: Test failure common.cu:806

dgxh100-4:460770:460869 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer 11.1.128.128<50998> dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:750 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net_ib.cc:781 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net.cc:826 -> 6

dgxh100-4:460770:460869 [0] misc/socket.cc:30 NCCL WARN socketProgressOpt: Call to recv from 11.1.128.128<52035> failed : Connection reset by peer dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:47 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:750 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net_ib.cc:710 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net.cc:677 -> 6

dgxh100-4:460770:460869 [0] misc/socket.cc:30 NCCL WARN socketProgressOpt: Call to recv from 11.1.128.128<35201> failed : Connection reset by peer dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:47 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:750 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net_ib.cc:710 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net.cc:677 -> 6 dgxh100-4:460770:460842 [0] NCCL INFO transport/net.cc:399 -> 6 dgxh100-4:460770:460842 [0] NCCL INFO transport.cc:166 -> 6

dgxh100-4:460770:460869 [0] misc/socket.cc:30 NCCL WARN socketProgressOpt: Call to recv from 11.1.128.128<43351> failed : Connection reset by peer dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:47 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO misc/socket.cc:750 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net_ib.cc:710 -> 6 dgxh100-4:460770:460869 [0] NCCL INFO transport/net.cc:677 -> 6 dgxh100-4:460770:460842 [0] NCCL INFO init.cc:1117 -> 6 dgxh100-4:460770:460842 [0] NCCL INFO init.cc:1396 -> 6 dgxh100-4:460770:460842 [0] NCCL INFO group.cc:64 -> 6 [Async thread] dgxh100-4:460770:460770 [0] NCCL INFO group.cc:418 -> 6 dgxh100-4:460770:460770 [0] NCCL INFO group.cc:95 -> 6 dgxh100-4: Test NCCL failure common.cu:908 'remote process exited or there was a network error / socketProgressOpt: Call to recv from 11.1.128.128<43351> failed : Connection reset by peer' .. dgxh100-4 pid 460770: Test failure common.cu:806

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[51270,1],1] Exit code: 3

sjeaugey commented 7 months ago

It could be due to GID_INDEX=3 not being valid. You should run show_gids and see which GID index to use. It's dependent on your IP configuration. You may need to fix your IP config and routing. But this is not something we can help with, given this is very dependent on the switch config and also very complex and time consuming. You should reach out to your ethernet vendor for help.