Closed sdonoso closed 1 month ago
Can you confirm the same version of the NCCL library is being used on both these hosts
i check and have the same version in the two nodes:
python -c "import torch;print(torch.cuda.nccl.version())"
(2, 19, 4)
Unfortunately, we don't currently check that the versions match during the NCCL runtime. And from the log it looks like at least one node is running a newer version of NCCL
puente:21488:21488 [0] NCCL INFO NCCL version 2.22.3+cuda12.4
Hello, I'm facing the same issue. Two nodes with 8 H100s and we can't seem to figure out the error.
executing nccl with #gpu 8 begin size 16M end size 256M, nprocpernode=1.
running all_reduce
# nThread 1 nGpus 8 minBytes 16777216 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 11509 on mbay-csp3 device 0 [0x18] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 11509 on mbay-csp3 device 1 [0x29] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 11509 on mbay-csp3 device 2 [0x3a] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 11509 on mbay-csp3 device 3 [0x5c] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 11509 on mbay-csp3 device 4 [0x9a] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 11509 on mbay-csp3 device 5 [0xaa] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 11509 on mbay-csp3 device 6 [0xba] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 11509 on mbay-csp3 device 7 [0xca] NVIDIA H100 80GB HBM3
# Rank 8 Group 0 Pid 157749 on mbay-csp2 device 0 [0x18] NVIDIA H100 80GB HBM3
# Rank 9 Group 0 Pid 157749 on mbay-csp2 device 1 [0x29] NVIDIA H100 80GB HBM3
# Rank 10 Group 0 Pid 157749 on mbay-csp2 device 2 [0x3a] NVIDIA H100 80GB HBM3
# Rank 11 Group 0 Pid 157749 on mbay-csp2 device 3 [0x5c] NVIDIA H100 80GB HBM3
# Rank 12 Group 0 Pid 157749 on mbay-csp2 device 4 [0x9a] NVIDIA H100 80GB HBM3
# Rank 13 Group 0 Pid 157749 on mbay-csp2 device 5 [0xaa] NVIDIA H100 80GB HBM3
# Rank 14 Group 0 Pid 157749 on mbay-csp2 device 6 [0xba] NVIDIA H100 80GB HBM3
# Rank 15 Group 0 Pid 157749 on mbay-csp2 device 7 [0xca] NVIDIA H100 80GB HBM3
mbay-csp3:11509:11509 [0] NCCL INFO Bootstrap : Using enx00e04c099226:172.17.88.220<0>
mbay-csp3:11509:11509 [0] NCCL INFO cudaDriverVersion 12020
mbay-csp2:157749:157749 [0] NCCL INFO cudaDriverVersion 12020
mbay-csp3:11509:11509 [0] NCCL INFO NCCL version 2.22.3+cuda12.2
mbay-csp2:157749:157749 [0] NCCL INFO Bootstrap : Using enx765d22e0329f:169.254.95.120<0>
mbay-csp2:157749:157749 [0] NCCL INFO NCCL version 2.22.3+cuda12.2
mbay-csp3:11509:11566 [6] NCCL INFO Plugin Path : /home/administrator/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mbay-csp3:11509:11566 [6] NCCL INFO P2P plugin IBext_v8
mbay-csp3:11509:11566 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_2:1/IB/SHARP [3]mlx5_3:1/IB/SHARP [4]mlx5_4:1/IB/SHARP [5]mlx5_6:1/IB/SHARP [6]mlx5_7:1/IB/SHARP [RO]; OOB enx00e04c099226:172.17.88.220<0>
mbay-csp3:11509:11566 [6] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11565 [5] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11560 [0] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11567 [7] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11562 [2] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11563 [3] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11564 [4] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11561 [1] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157804 [5] NCCL INFO Plugin Path : /home/administrator/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mbay-csp2:157749:157804 [5] NCCL INFO P2P plugin IBext_v8
mbay-csp2:157749:157804 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_2:1/IB/SHARP [3]mlx5_4:1/IB/SHARP [4]mlx5_5:1/IB/SHARP [5]mlx5_6:1/IB/SHARP [6]mlx5_7:1/IB/SHARP [7]mlx5_8:1/IB/SHARP [8]mlx5_3:1/RoCE [RO]; OOB enx765d22e0329f:169.254.95.120<0>
mbay-csp2:157749:157806 [7] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157802 [3] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157803 [4] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157800 [1] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157805 [6] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157801 [2] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157799 [0] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157804 [5] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11567 [7] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 169.254.95.120<34065> returned 101(Network is unreachable) errno 115(Operation now in progress)
mbay-csp3:11509:11567 [7] NCCL INFO misc/socket.cc:570 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO misc/socket.cc:621 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO bootstrap.cc:298 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO init.cc:1393 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO group.cc:70 -> 2 [Async thread]
mbay-csp3:11509:11548 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 169.254.95.120<53571> failed : Network is unreachable
mbay-csp3:11509:11548 [0] NCCL INFO misc/socket.cc:567 -> 2
mbay-csp3:11509:11548 [0] NCCL INFO misc/socket.cc:621 -> 2
mbay-csp3:11509:11548 [0] NCCL INFO bootstrap.cc:163 -> 2
mbay-csp3:11509:11563 [3] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:85 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11560 [0] NCCL INFO bootstrap.cc:301 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:85 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11560 [0] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:90 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11566 [6] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:90 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11561 [1] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:85 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11562 [2] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:90 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11560 [0] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11509 [7] NCCL INFO group.cc:420 -> 2
mbay-csp3:11509:11509 [7] NCCL INFO group.cc:546 -> 2
mbay-csp3:11509:11509 [7] NCCL INFO group.cc:101 -> 2
mbay-csp3: Test NCCL failure common.cu:997 'unhandled system error (run with NCCL_DEBUG=INFO for details) / '
.. mbay-csp3 pid 11509: Test failure common.cu:876
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[10467,1],0]
Exit code: 3
--------------------------------------------------------------------------
The two nodes are using the same NCCL library: 1st Node: 2nd Node:
Both Nodes also have the same hpxc foulder:
We also have a similar script as sdonoso, any pointers?
@tonyw1213 Your problem is probably completely different: it looks like a network connectivity issue. NCCL reports using the enx00e04c099226:172.17.88.220
interface on mbay-csp3
, whereas on mbay-csp2
it's using enx765d22e0329f:169.254.95.120
. These two networks can't talk to each other (you can verify using ping
), which is why you see the Network is unreachable
messages when NCCL processes try to communicate. I'm guessing you'll need to use the NCCL_SOCKET_IFNAME
(see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname) variable to specify which network interface(s) are to be used.
Hello @kiskra-nvidia, thank you for the fast response. I'm able to ping each other and have made sure these two nodes can talk to each other. I'm also able to SSH between the nodes.
@tonyw1213 check what is the name of the interface on mbay-csp2 which has IP address 172.17.10.102
, and set NCCL_SOCKET_IFNAME=<ip interface>
. Otherwise NCCL will pick the first interface it finds, and in your case it's enx765d22e0329f
which has IP 169.254.95.120
.
@sjeaugey Thank you for the help. Our issue was that our csp2 was trying to connect to the wrong IP interface. Instead of using NCCL_SOCKET_IFNAME
my team decided to remove enx765 as a possibility to connect to. This way we allowed NCCL to find the right interface.
also we are big fans of your work
Thanks. Closing.
Hi, i have the next error when try the test: There are two nodes with 8 A100 with the same config.
mpirun version
The Environment