NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / ' #204

Open a-c-dream opened 3 months ago

a-c-dream commented 3 months ago

The following error occurred when I used openmpi to run nccl-tests on multiple machines (on two nodes) image Here are some log messages

# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  63755 on DisAI-4090-3 device  0 [0x1b] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid  63755 on DisAI-4090-3 device  1 [0x3e] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid  63755 on DisAI-4090-3 device  2 [0x89] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid  63755 on DisAI-4090-3 device  3 [0xb2] NVIDIA GeForce RTX 4090
#  Rank  4 Group  0 Pid 914454 on DisAI-4090-2 device  0 [0x1b] NVIDIA GeForce RTX 4090
#  Rank  5 Group  0 Pid 914454 on DisAI-4090-2 device  1 [0x3e] NVIDIA GeForce RTX 4090
#  Rank  6 Group  0 Pid 914454 on DisAI-4090-2 device  2 [0x89] NVIDIA GeForce RTX 4090
#  Rank  7 Group  0 Pid 914454 on DisAI-4090-2 device  3 [0xb2] NVIDIA GeForce RTX 4090
DisAI-4090-3:63755:63755 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
DisAI-4090-3:63755:63755 [0] NCCL INFO Bootstrap : Using eno1:10.102.0.233<0>
DisAI-4090-3:63755:63755 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
DisAI-4090-3:63755:63755 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.19.3+cuda12.3
DisAI-4090-2:914454:914454 [0] NCCL INFO cudaDriverVersion 12040
DisAI-4090-2:914454:914454 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
DisAI-4090-2:914454:914454 [0] NCCL INFO Bootstrap : Using eno1:10.102.0.232<0>
DisAI-4090-2:914454:914454 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
DisAI-4090-3:63755:63769 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
DisAI-4090-3:63755:63769 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_3:1/RoCE [2]irdma0:1/RoCE [3]irdma1:1/RoCE [RO]; OOB eno1:10.102.0.233<0>
DisAI-4090-3:63755:63771 [3] NCCL INFO Using non-device net plugin version 0
DisAI-4090-3:63755:63771 [3] NCCL INFO Using network IB
DisAI-4090-3:63755:63769 [1] NCCL INFO Using non-device net plugin version 0
DisAI-4090-3:63755:63769 [1] NCCL INFO Using network IB
DisAI-4090-3:63755:63770 [2] NCCL INFO Using non-device net plugin version 0
DisAI-4090-3:63755:63770 [2] NCCL INFO Using network IB
DisAI-4090-3:63755:63768 [0] NCCL INFO Using non-device net plugin version 0
DisAI-4090-3:63755:63768 [0] NCCL INFO Using network IB
DisAI-4090-2:914454:914481 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
DisAI-4090-2:914454:914481 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_3:1/RoCE [2]irdma1:1/RoCE [RO]; OOB eno1:10.102.0.232<0>
DisAI-4090-2:914454:914481 [3] NCCL INFO Using non-device net plugin version 0
DisAI-4090-2:914454:914481 [3] NCCL INFO Using network IB
DisAI-4090-2:914454:914479 [1] NCCL INFO Using non-device net plugin version 0
DisAI-4090-2:914454:914479 [1] NCCL INFO Using network IB
DisAI-4090-2:914454:914478 [0] NCCL INFO Using non-device net plugin version 0
DisAI-4090-2:914454:914478 [0] NCCL INFO Using network IB
DisAI-4090-2:914454:914480 [2] NCCL INFO Using non-device net plugin version 0
DisAI-4090-2:914454:914480 [2] NCCL INFO Using network IB
DisAI-4090-3:63755:63768 [0] NCCL INFO comm 0x5cdda158c8c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x268de34c5d8ebe44 - Init START
DisAI-4090-3:63755:63769 [1] NCCL INFO comm 0x5cdda1595770 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 3e000 commId 0x268de34c5d8ebe44 - Init START
DisAI-4090-3:63755:63770 [2] NCCL INFO comm 0x5cdda159e5f0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 89000 commId 0x268de34c5d8ebe44 - Init START
DisAI-4090-3:63755:63771 [3] NCCL INFO comm 0x5cdda15a7470 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId b2000 commId 0x268de34c5d8ebe44 - Init START

DisAI-4090-2:914454:914478 [0] bootstrap.cc:77 NCCL WARN Message truncated : received 64 bytes instead of 8
DisAI-4090-2:914454:914478 [0] NCCL INFO bootstrap.cc:412 -> 3
DisAI-4090-2:914454:914478 [0] NCCL INFO bootstrap.cc:312 -> 3
DisAI-4090-2:914454:914478 [0] NCCL INFO init.cc:1493 -> 3
DisAI-4090-2:914454:914478 [0] NCCL INFO group.cc:64 -> 3 [Async thread]
DisAI-4090-2:914454:914480 [2] NCCL INFO misc/socket.cc:47 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO misc/socket.cc:58 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO misc/socket.cc:789 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO bootstrap.cc:75 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO bootstrap.cc:412 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO bootstrap.cc:312 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO init.cc:1493 -> 3
DisAI-4090-2:914454:914480 [2] NCCL INFO group.cc:64 -> 3 [Async thread]
DisAI-4090-2:914454:914481 [3] NCCL INFO misc/socket.cc:47 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO misc/socket.cc:58 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO misc/socket.cc:789 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO bootstrap.cc:75 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO bootstrap.cc:412 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO bootstrap.cc:312 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO init.cc:1493 -> 3
DisAI-4090-2:914454:914481 [3] NCCL INFO group.cc:64 -> 3 [Async thread]
DisAI-4090-2:914454:914479 [1] NCCL INFO misc/socket.cc:47 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO misc/socket.cc:58 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO misc/socket.cc:789 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO bootstrap.cc:75 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO bootstrap.cc:412 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO bootstrap.cc:312 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO init.cc:1493 -> 3
DisAI-4090-2:914454:914479 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
DisAI-4090-2:914454:914454 [3] NCCL INFO group.cc:418 -> 3
DisAI-4090-2:914454:914454 [3] NCCL INFO group.cc:95 -> 3
DisAI-4090-2: Test NCCL failure common.cu:961 'internal error - please report this issue to the NCCL developers / '
 .. DisAI-4090-2 pid 914454: Test failure common.cu:844
DisAI-4090-3:63755:63771 [3] NCCL INFO NVLS multicast support is not available on dev 3

DisAI-4090-3:63755:63768 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer 10.102.0.232<40318>
DisAI-4090-3:63755:63768 [0] NCCL INFO misc/socket.cc:58 -> 6
DisAI-4090-3:63755:63768 [0] NCCL INFO misc/socket.cc:787 -> 6
DisAI-4090-3:63755:63768 [0] NCCL INFO bootstrap.cc:75 -> 6
DisAI-4090-3:63755:63768 [0] NCCL INFO bootstrap.cc:399 -> 6
DisAI-4090-3:63755:63768 [0] NCCL INFO init.cc:820 -> 6
DisAI-4090-3:63755:63768 [0] NCCL INFO init.cc:1396 -> 6
DisAI-4090-3:63755:63768 [0] NCCL INFO group.cc:64 -> 6 [Async thread]
DisAI-4090-3:63755:63769 [1] NCCL INFO misc/socket.cc:47 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO misc/socket.cc:58 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO misc/socket.cc:787 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO bootstrap.cc:75 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO bootstrap.cc:399 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO init.cc:820 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO init.cc:1396 -> 3
DisAI-4090-3:63755:63769 [1] NCCL INFO group.cc:64 -> 3 [Async thread]
DisAI-4090-3:63755:63770 [2] NCCL INFO misc/socket.cc:47 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO misc/socket.cc:58 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO misc/socket.cc:787 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO bootstrap.cc:75 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO bootstrap.cc:399 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO init.cc:820 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO init.cc:1396 -> 3
DisAI-4090-3:63755:63770 [2] NCCL INFO group.cc:64 -> 3 [Async thread]

DisAI-4090-3:63755:63771 [3] misc/socket.cc:30 NCCL WARN socketProgressOpt: Call to recv from 10.102.0.232<49415> failed : Connection reset by peer
DisAI-4090-3:63755:63771 [3] NCCL INFO misc/socket.cc:47 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO misc/socket.cc:58 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO misc/socket.cc:773 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO bootstrap.cc:69 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO bootstrap.cc:397 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO init.cc:976 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO init.cc:1396 -> 6
DisAI-4090-3:63755:63771 [3] NCCL INFO group.cc:64 -> 6 [Async thread]
DisAI-4090-3:63755:63755 [3] NCCL INFO group.cc:418 -> 6
DisAI-4090-3:63755:63755 [3] NCCL INFO group.cc:95 -> 6
DisAI-4090-3: Test NCCL failure common.cu:961 'remote process exited or there was a network error / '
 .. DisAI-4090-3 pid 63755: Test failure common.cu:844
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31551,1],1]
  Exit code:    3
--------------------------------------------------------------------------
sjeaugey commented 3 months ago

Hi,

Would you be able to try again with NCCL 2.21 (latest NCCL release) ?

a-c-dream commented 3 months ago

My nccl version is 2.19.3. I installed nccl by installing pytorch, can you tell me how to upgrade it? image

sjeaugey commented 3 months ago

It should be as simple as:

git clone https://github.com/nvidia/nccl
cd nccl
make -j # will take a few minutes
export NCCL_HOME=$PWD/build
cd /path/to/nccl-tests
mpirun -x LD_LIBRARY_PATH=$NCCL_HOME/lib:$LD_LIBRARY_PATH <rest of the mpirun command line to run NCCL tests>
a-c-dream commented 3 months ago

In addition, I get the following error if I execute the command on server 232, and no error if I execute on server 233 (as shown in the first picture). image Does that help solve the problem?

sjeaugey commented 3 months ago

Looks like you're missing NCCL on one node. If you recompile it you should make sure it's visible from all nodes (e.g. it's on a shared NFS or copied over to every node at the same location).

a-c-dream commented 3 months ago

As you can see, both servers have nccl image image

a-c-dream commented 3 months ago

@sjeaugey I added -x LD_LIBRARY_PATH=$NCCL_HOME/lib:$LD_LIBRARY_PATH and it worked, but on a single machine, not multiple machines. No matter which server it is running on, it is running on a stand-alone version. This is the command I used

mpirun -np 2 -pernode \
--allow-run-as-root \
-host 10.102.0.233,10.102.0.232 \
-mca btl_tcp_if_include eno1  \
-x NCCL_SOCKET_IFNAME=eno1  \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH=/home/xxx/anaconda3/envs/new_pytorch/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH
./build/all_reduce_perf NCCL_P2P_LEVEL=0 -b 8 -e 128M -f 2 -g 4 -c 0

image