Open Assassin187 opened 5 months ago
Can you run again with NCCL_DEBUG=INFO
?
Can you run again with
NCCL_DEBUG=INFO
?
Alright, this is the output after I re-executed the command mpirun -np 2 -pernode --allow-run-as-root -hostfile host.txt -mca btl_tcp_if_include 192.168.0.0/24 -x NCCL_SOCKET_IFNAME=enp6s0 ./nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 -c 0
with debugging information included.
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 0 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 3838159 on fengyun416 device 0 [0x01] NVIDIA RTX A6000
# Rank 1 Group 0 Pid 3838159 on fengyun416 device 1 [0x07] NVIDIA RTX A6000
# Rank 2 Group 0 Pid 3237691 on fengyun device 0 [0x01] NVIDIA RTX A6000
# Rank 3 Group 0 Pid 3237691 on fengyun device 1 [0x05] NVIDIA RTX A6000
fengyun416:3838159:3838159 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp6s0
fengyun416:3838159:3838159 [0] NCCL INFO Bootstrap : Using enp6s0:192.168.0.204<0>
fengyun416:3838159:3838159 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
fengyun416:3838159:3838159 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
fengyun416:3838159:3838159 [0] NCCL INFO NET/Plugin: Using internal network plugin.
fengyun416:3838159:3838159 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.21.5+cuda12.2
fengyun: Test NCCL failure common.cu:959 'internal error - please report this issue to the NCCL developers / '
.. fengyun pid 3237691: Test failure common.cu:844
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
fengyun416:3838159:3838189 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp6s0
fengyun416:3838159:3838189 [0] NCCL INFO NET/IB : No device found.
fengyun416:3838159:3838189 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp6s0
fengyun416:3838159:3838189 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.0.204<0>
fengyun416:3838159:3838189 [0] NCCL INFO Using non-device net plugin version 0
fengyun416:3838159:3838189 [0] NCCL INFO Using network Socket
fengyun416:3838159:3838190 [1] NCCL INFO Using non-device net plugin version 0
fengyun416:3838159:3838190 [1] NCCL INFO Using network Socket
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[31381,1],1]
Exit code: 3
--------------------------------------------------------------------------
I don't think that log shows the NCCL_DEBUG=INFO
from the failing process?
Did you use -x NCCL_DEBUG=INFO
on the mpirun command line?
I have now included the debugging instructions in the command line, but it seems that there are still no relevant error messages.
The WARN above suggests that the NCCL_SOCKET_IFNAME
you specified is not valid on the fengyun
node. If you have to specify a NCCL_SOCKET_IFNAME
then it has to be available as that same device name on all nodes.
So you mean that if I want to specify NCCL_SOCKET_IFNAME
, the network interface names on both servers should be the same? But the network interface name on the other server is enp4s0, how should I input this in the command line?
Sorry different NCCL_SOCKET_IFNAME
names are not supported.
But do you need to specify NCCL_SOCKET_IFNAME
?
Maybe you could write a wrapper script that sets the NCCL_SOCKET_IFNAME based on the hostname?
Or perhaps try setting it to the different names in /etc/nccl.conf
on each node would work?
I discovered that there was no/etc/nccl.conf
file in the system, so I created this file and set the respective variable NCCL_SOCKET_IFNAME for each host. However, this file does not seem to take effect on the server with the network interface named enp6s0, but it does work on the server with the network interface named enp4s0.
I just searched and found a method in this issue (https://github.com/NVIDIA/nccl/issues/286) that allows specifying multiple network interface names for NCCL_SOCKET_IFNAME. I tried this method, and it worked. Now my test case is running normally.
I am using the
mpirun
command to test the all_reduce_perf file of nccl-tests on two servers within the same local area network. I am able to run other files normally with thempirun
command, but when I use the commandmpirun -np 2 -pernode --allow-run-as-root -hostfile host.txt -mca btl_tcp_if_include 192.168.0.0/24 -x NCCL_SOCKET_IFNAME=enp6s0 ./nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 -c 0
to test this file, the following error occurs: