NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Test NCCL failure common.cu:959 'internal error - please report this issue to the NCCL developers / ' #219

Open Assassin187 opened 2 months ago

Assassin187 commented 2 months ago

I am using the mpiruncommand to test the all_reduce_perf file of nccl-tests on two servers within the same local area network. I am able to run other files normally with the mpiruncommand, but when I use the command mpirun -np 2 -pernode --allow-run-as-root -hostfile host.txt -mca btl_tcp_if_include 192.168.0.0/24 -x NCCL_SOCKET_IFNAME=enp6s0 ./nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 -c 0 to test this file, the following error occurs:

# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3740409 on fengyun416 device  0 [0x01] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid 3740409 on fengyun416 device  1 [0x07] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid 3164639 on    fengyun device  0 [0x01] NVIDIA RTX A6000
#  Rank  3 Group  0 Pid 3164639 on    fengyun device  1 [0x05] NVIDIA RTX A6000
fengyun: Test NCCL failure common.cu:959 'internal error - please report this issue to the NCCL developers / '
 .. fengyun pid 3164639: Test failure common.cu:844
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63721,1],1]
  Exit code:    3
--------------------------------------------------------------------------
sjeaugey commented 2 months ago

Can you run again with NCCL_DEBUG=INFO?

Assassin187 commented 2 months ago

Can you run again with NCCL_DEBUG=INFO?

Alright, this is the output after I re-executed the command mpirun -np 2 -pernode --allow-run-as-root -hostfile host.txt -mca btl_tcp_if_include 192.168.0.0/24 -x NCCL_SOCKET_IFNAME=enp6s0 ./nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 -c 0with debugging information included.

# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 3838159 on fengyun416 device  0 [0x01] NVIDIA RTX A6000
#  Rank  1 Group  0 Pid 3838159 on fengyun416 device  1 [0x07] NVIDIA RTX A6000
#  Rank  2 Group  0 Pid 3237691 on    fengyun device  0 [0x01] NVIDIA RTX A6000
#  Rank  3 Group  0 Pid 3237691 on    fengyun device  1 [0x05] NVIDIA RTX A6000
fengyun416:3838159:3838159 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp6s0
fengyun416:3838159:3838159 [0] NCCL INFO Bootstrap : Using enp6s0:192.168.0.204<0>
fengyun416:3838159:3838159 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fengyun416:3838159:3838159 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fengyun416:3838159:3838159 [0] NCCL INFO NET/Plugin: Using internal network plugin.
fengyun416:3838159:3838159 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda12.2
fengyun: Test NCCL failure common.cu:959 'internal error - please report this issue to the NCCL developers / '
 .. fengyun pid 3237691: Test failure common.cu:844
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
fengyun416:3838159:3838189 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp6s0
fengyun416:3838159:3838189 [0] NCCL INFO NET/IB : No device found.
fengyun416:3838159:3838189 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp6s0
fengyun416:3838159:3838189 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.0.204<0>
fengyun416:3838159:3838189 [0] NCCL INFO Using non-device net plugin version 0
fengyun416:3838159:3838189 [0] NCCL INFO Using network Socket
fengyun416:3838159:3838190 [1] NCCL INFO Using non-device net plugin version 0
fengyun416:3838159:3838190 [1] NCCL INFO Using network Socket
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[31381,1],1]
  Exit code:    3
--------------------------------------------------------------------------
AddyLaddy commented 2 months ago

I don't think that log shows the NCCL_DEBUG=INFO from the failing process? Did you use -x NCCL_DEBUG=INFO on the mpirun command line?

Assassin187 commented 2 months ago

I have now included the debugging instructions in the command line, but it seems that there are still no relevant error messages. image

AddyLaddy commented 2 months ago

The WARN above suggests that the NCCL_SOCKET_IFNAME you specified is not valid on the fengyun node. If you have to specify a NCCL_SOCKET_IFNAME then it has to be available as that same device name on all nodes.

Assassin187 commented 2 months ago

So you mean that if I want to specify NCCL_SOCKET_IFNAME, the network interface names on both servers should be the same? But the network interface name on the other server is enp4s0, how should I input this in the command line?

AddyLaddy commented 2 months ago

Sorry different NCCL_SOCKET_IFNAME names are not supported. But do you need to specify NCCL_SOCKET_IFNAME ?

Maybe you could write a wrapper script that sets the NCCL_SOCKET_IFNAME based on the hostname? Or perhaps try setting it to the different names in /etc/nccl.conf on each node would work?

Assassin187 commented 2 months ago

I discovered that there was no/etc/nccl.conf file in the system, so I created this file and set the respective variable NCCL_SOCKET_IFNAME for each host. However, this file does not seem to take effect on the server with the network interface named enp6s0, but it does work on the server with the network interface named enp4s0. image

Assassin187 commented 2 months ago

I just searched and found a method in this issue (https://github.com/NVIDIA/nccl/issues/286) that allows specifying multiple network interface names for NCCL_SOCKET_IFNAME. I tried this method, and it worked. Now my test case is running normally.