NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Test NCCL failure common.cu:997 'internal error #231

Closed sdonoso closed 1 month ago

sdonoso commented 1 month ago

Hi, i have the next error when try the test: There are two nodes with 8 A100 with the same config.

rene@puente:~/nccl-tests$ mpirun -np 2 -f hostfile -env UCX_NET_DEVICES=mlx5_0:1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  21488 on    puente device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 966509 on     kalila device  0 [0x07] NVIDIA A100-SXM4-80GB
puente:21488:21488 [0] NCCL INFO Bootstrap : Using enp45s0f0:149.153.153.83<0>
puente:21488:21488 [0] NCCL INFO cudaDriverVersion 12020
puente:21488:21488 [0] NCCL INFO NCCL version 2.22.3+cuda12.4
kalila:966509:966509 [0] NCCL INFO cudaDriverVersion 12020
kalila:966509:966509 [0] NCCL INFO Bootstrap : Using enp45s0f0:149.153.153.84<0>
kalila:966509:966509 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
puente:21488:21838 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
puente:21488:21838 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB [RO]; OOB enp45s0f0:149.153.153.83<0>
puente:21488:21838 [0] NCCL INFO Using network IB
kalila:966509:966515 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB [RO]; OOB enp45s0f0:149.153.153.84<0>
kalila:966509:966515 [0] NCCL INFO Using non-device net plugin version 0
kalila:966509:966515 [0] NCCL INFO Using network IB
kalila:966509:966515 [0] NCCL INFO comm 0x56446c7897a0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0xe846e9d398e293d7 - Init START

puente:21488:21838 [0] bootstrap.cc:87 NCCL WARN Message truncated : received 64 bytes instead of 8
puente:21488:21838 [0] NCCL INFO bootstrap.cc:543 -> 3
puente:21488:21838 [0] NCCL INFO bootstrap.cc:554 -> 3
puente:21488:21838 [0] NCCL INFO bootstrap.cc:323 -> 3
puente:21488:21838 [0] NCCL INFO init.cc:1393 -> 3
puente:21488:21838 [0] NCCL INFO group.cc:70 -> 3 [Async thread]
puente:21488:21488 [0] NCCL INFO group.cc:420 -> 3
puente:21488:21488 [0] NCCL INFO group.cc:546 -> 3
puente:21488:21488 [0] NCCL INFO group.cc:101 -> 3
puente: Test NCCL failure common.cu:997 'internal error - please report this issue to the NCCL developers / '
 .. puente pid 21488: Test failure common.cu:876

kalila:966509:966515 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer puente<34218>
kalila:966509:966515 [0] NCCL INFO misc/socket.cc:58 -> 6
kalila:966509:966515 [0] NCCL INFO misc/socket.cc:787 -> 6
kalila:966509:966515 [0] NCCL INFO bootstrap.cc:80 -> 6
kalila:966509:966515 [0] NCCL INFO bootstrap.cc:399 -> 6
kalila:966509:966515 [0] NCCL INFO init.cc:814 -> 6
kalila:966509:966515 [0] NCCL INFO init.cc:1390 -> 6
kalila:966509:966515 [0] NCCL INFO group.cc:64 -> 6 [Async thread]
kalila:966509:966509 [0] NCCL INFO group.cc:418 -> 6
kalila:966509:966509 [0] NCCL INFO group.cc:95 -> 6
kalila: Test NCCL failure common.cu:997 'remote process exited or there was a network error / '
 .. kalila pid 966509: Test failure common.cu:876

mpirun version

rene@puente:~/nccl-tests$ mpirun --version
HYDRA build details:
    Version:                                 4.2.1
    Release Date:                            Wed Apr 17 15:30:02 CDT 2024
    CC:                              gcc      
    Configure options:                       '--disable-option-checking' '--prefix=/usr/local/mpich' '--with-device=ch4:ucx' '--cache-file=/dev/null' '--srcdir=../../../../src/pm/hydra' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -DNETMOD_INLINE=__netmod_inline_ucx__ -I/home/rene/mpich-4.2.1/build/src/mpl/include -I/home/rene/mpich-4.2.1/src/mpl/include -I/home/rene/mpich-4.2.1/modules/json-c -I/home/rene/mpich-4.2.1/build/modules/json-c -D_REENTRANT -I/home/rene/mpich-4.2.1/build/src/mpi/romio/include -I/home/rene/mpich-4.2.1/src/pmi/include -I/home/rene/mpich-4.2.1/build/src/pmi/include -I/home/rene/mpich-4.2.1/build/modules/yaksa/src/frontend/include -I/home/rene/mpich-4.2.1/modules/yaksa/src/frontend/include -I/home/rene/mpich-4.2.1/build/modules/ucx/src -I/home/rene/mpich-4.2.1/modules/ucx/src'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Demux engines available:                 poll select

The Environment

OS:Ubuntu 22.04

MLNX_OFED_LINUX-24.04-0.6.6.0:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
AddyLaddy commented 1 month ago

Can you confirm the same version of the NCCL library is being used on both these hosts

sdonoso commented 1 month ago

i check and have the same version in the two nodes:

 python -c "import torch;print(torch.cuda.nccl.version())"
(2, 19, 4)
AddyLaddy commented 1 month ago

Unfortunately, we don't currently check that the versions match during the NCCL runtime. And from the log it looks like at least one node is running a newer version of NCCL

puente:21488:21488 [0] NCCL INFO NCCL version 2.22.3+cuda12.4
tonyw1213 commented 1 month ago

Hello, I'm facing the same issue. Two nodes with 8 H100s and we can't seem to figure out the error.

executing nccl with #gpu 8 begin size 16M end size 256M, nprocpernode=1.
running all_reduce
# nThread 1 nGpus 8 minBytes 16777216 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  11509 on  mbay-csp3 device  0 [0x18] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid  11509 on  mbay-csp3 device  1 [0x29] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid  11509 on  mbay-csp3 device  2 [0x3a] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid  11509 on  mbay-csp3 device  3 [0x5c] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid  11509 on  mbay-csp3 device  4 [0x9a] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid  11509 on  mbay-csp3 device  5 [0xaa] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid  11509 on  mbay-csp3 device  6 [0xba] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid  11509 on  mbay-csp3 device  7 [0xca] NVIDIA H100 80GB HBM3
#  Rank  8 Group  0 Pid 157749 on  mbay-csp2 device  0 [0x18] NVIDIA H100 80GB HBM3
#  Rank  9 Group  0 Pid 157749 on  mbay-csp2 device  1 [0x29] NVIDIA H100 80GB HBM3
#  Rank 10 Group  0 Pid 157749 on  mbay-csp2 device  2 [0x3a] NVIDIA H100 80GB HBM3
#  Rank 11 Group  0 Pid 157749 on  mbay-csp2 device  3 [0x5c] NVIDIA H100 80GB HBM3
#  Rank 12 Group  0 Pid 157749 on  mbay-csp2 device  4 [0x9a] NVIDIA H100 80GB HBM3
#  Rank 13 Group  0 Pid 157749 on  mbay-csp2 device  5 [0xaa] NVIDIA H100 80GB HBM3
#  Rank 14 Group  0 Pid 157749 on  mbay-csp2 device  6 [0xba] NVIDIA H100 80GB HBM3
#  Rank 15 Group  0 Pid 157749 on  mbay-csp2 device  7 [0xca] NVIDIA H100 80GB HBM3
mbay-csp3:11509:11509 [0] NCCL INFO Bootstrap : Using enx00e04c099226:172.17.88.220<0>
mbay-csp3:11509:11509 [0] NCCL INFO cudaDriverVersion 12020
mbay-csp2:157749:157749 [0] NCCL INFO cudaDriverVersion 12020
mbay-csp3:11509:11509 [0] NCCL INFO NCCL version 2.22.3+cuda12.2
mbay-csp2:157749:157749 [0] NCCL INFO Bootstrap : Using enx765d22e0329f:169.254.95.120<0>
mbay-csp2:157749:157749 [0] NCCL INFO NCCL version 2.22.3+cuda12.2
mbay-csp3:11509:11566 [6] NCCL INFO Plugin Path : /home/administrator/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mbay-csp3:11509:11566 [6] NCCL INFO P2P plugin IBext_v8
mbay-csp3:11509:11566 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_2:1/IB/SHARP [3]mlx5_3:1/IB/SHARP [4]mlx5_4:1/IB/SHARP [5]mlx5_6:1/IB/SHARP [6]mlx5_7:1/IB/SHARP [RO]; OOB enx00e04c099226:172.17.88.220<0>
mbay-csp3:11509:11566 [6] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11565 [5] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11560 [0] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11567 [7] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11562 [2] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11563 [3] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11564 [4] NCCL INFO Using network IBext_v8
mbay-csp3:11509:11561 [1] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157804 [5] NCCL INFO Plugin Path : /home/administrator/hpcx-v2.19-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mbay-csp2:157749:157804 [5] NCCL INFO P2P plugin IBext_v8
mbay-csp2:157749:157804 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_2:1/IB/SHARP [3]mlx5_4:1/IB/SHARP [4]mlx5_5:1/IB/SHARP [5]mlx5_6:1/IB/SHARP [6]mlx5_7:1/IB/SHARP [7]mlx5_8:1/IB/SHARP [8]mlx5_3:1/RoCE [RO]; OOB enx765d22e0329f:169.254.95.120<0>
mbay-csp2:157749:157806 [7] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157802 [3] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157803 [4] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157800 [1] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157805 [6] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157801 [2] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157799 [0] NCCL INFO Using network IBext_v8
mbay-csp2:157749:157804 [5] NCCL INFO Using network IBext_v8

mbay-csp3:11509:11567 [7] misc/socket.cc:533 NCCL WARN socketPollConnect: Connect to 169.254.95.120<34065> returned 101(Network is unreachable) errno 115(Operation now in progress)
mbay-csp3:11509:11567 [7] NCCL INFO misc/socket.cc:570 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO misc/socket.cc:621 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO bootstrap.cc:298 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO init.cc:1393 -> 2
mbay-csp3:11509:11567 [7] NCCL INFO group.cc:70 -> 2 [Async thread]

mbay-csp3:11509:11548 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 169.254.95.120<53571> failed : Network is unreachable
mbay-csp3:11509:11548 [0] NCCL INFO misc/socket.cc:567 -> 2
mbay-csp3:11509:11548 [0] NCCL INFO misc/socket.cc:621 -> 2
mbay-csp3:11509:11548 [0] NCCL INFO bootstrap.cc:163 -> 2
mbay-csp3:11509:11563 [3] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:85 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11560 [0] NCCL INFO bootstrap.cc:301 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:85 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11560 [0] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:90 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11565 [5] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11566 [6] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:90 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11566 [6] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11561 [1] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:85 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11561 [1] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11564 [4] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11564 [4] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11562 [2] NCCL INFO misc/socket.cc:47 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO misc/socket.cc:805 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:90 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11562 [2] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11560 [0] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:543 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:554 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO bootstrap.cc:306 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO init.cc:1393 -> 3
mbay-csp3:11509:11563 [3] NCCL INFO group.cc:70 -> 3 [Async thread]
mbay-csp3:11509:11509 [7] NCCL INFO group.cc:420 -> 2
mbay-csp3:11509:11509 [7] NCCL INFO group.cc:546 -> 2
mbay-csp3:11509:11509 [7] NCCL INFO group.cc:101 -> 2
mbay-csp3: Test NCCL failure common.cu:997 'unhandled system error (run with NCCL_DEBUG=INFO for details) / '
 .. mbay-csp3 pid 11509: Test failure common.cu:876
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10467,1],0]
  Exit code:    3
--------------------------------------------------------------------------

The two nodes are using the same NCCL library: 1st Node: image 2nd Node: image

Both Nodes also have the same hpxc foulder: image

We also have a similar script as sdonoso, any pointers?

kiskra-nvidia commented 1 month ago

@tonyw1213 Your problem is probably completely different: it looks like a network connectivity issue. NCCL reports using the enx00e04c099226:172.17.88.220 interface on mbay-csp3, whereas on mbay-csp2 it's using enx765d22e0329f:169.254.95.120. These two networks can't talk to each other (you can verify using ping), which is why you see the Network is unreachable messages when NCCL processes try to communicate. I'm guessing you'll need to use the NCCL_SOCKET_IFNAME (see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname) variable to specify which network interface(s) are to be used.

tonyw1213 commented 1 month ago

Hello @kiskra-nvidia, thank you for the fast response. I'm able to ping each other and have made sure these two nodes can talk to each other. I'm also able to SSH between the nodes. image

sjeaugey commented 1 month ago

@tonyw1213 check what is the name of the interface on mbay-csp2 which has IP address 172.17.10.102, and set NCCL_SOCKET_IFNAME=<ip interface>. Otherwise NCCL will pick the first interface it finds, and in your case it's enx765d22e0329f which has IP 169.254.95.120.

tonyw1213 commented 1 month ago

@sjeaugey Thank you for the help. Our issue was that our csp2 was trying to connect to the wrong IP interface. Instead of using NCCL_SOCKET_IFNAME my team decided to remove enx765 as a possibility to connect to. This way we allowed NCCL to find the right interface.

also we are big fans of your work

sjeaugey commented 1 month ago

Thanks. Closing.