NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Issue Running NCCL Tests on Gentoo with Varying GPU Availability: CUDA failure common.cu:892 'invalid device ordinal' #171

Closed SweeneyJun closed 10 months ago

SweeneyJun commented 10 months ago

Description: I'm experiencing issues while trying to run NCCL tests on a Gentoo-based system. Here are the details of my setup:

I'm trying to use the mpirun command to run NCCL tests on four physical machines (sniper, slark, clinkz, mirana), each with four NVIDIA TITAN Xp GPUs. However, some of these GPUs are malfunctioning, and their orders can change between reboots (The error one's index may change). Specifically, the available GPU counts on the four machines are 3, 2, 3, and 2.

Here's the command I'm using to run the tests:

/home/myusername/openmpi/bin/mpirun --mca plm_rsh_agent /usr/bin/ssh --mca btl_tcp_if_include custom-L -np 4 --hostfile ./hostMPI -x LD_LIBRARY_PATH -x PATH -x CUDA_VISIBLE_DEVICES -x NCCL_ALGO -x NCCL_BUFFSIZE -x NCCL_CHECK_POINTERS -x NCCL_COMM_BLOCKING -x NCCL_CROSS_NIC -x NCCL_DEBUG -x NCCL_DMABUF_ENABLE -x NCCL_GDR_READ -x NCCL_GRAPH_MIXING_SUPPORT -x NCCL_GRAPH_REGISTER -x NCCL_IGNORE_CPU_AFFINITY /home/myusername/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

The hostMPI file contains the following entries:

192.168.3.9 max-slots=1
192.168.3.11 max-slots=1
192.168.3.12 max-slots=1
192.168.3.14 max-slots=1

When I run the tests, I encounter the following error:

INFO:root:...
clinkz: Test CUDA failure common.cu:892 'invalid device ordinal'
...
sniper: Test CUDA failure common.cu:892 'invalid device ordinal'
...
mirana: Test CUDA failure common.cu:892 'invalid device ordinal'
...
slark: Test CUDA failure common.cu:892 'invalid device ordinal'
...

I attempted to reduce the scale by running tests on only two machines (sniper and slark) with -np 2, but I still encountered the same error.

INFO:root:...
sniper: Test CUDA failure common.cu:892 'invalid device ordinal'
...
slark: Test CUDA failure common.cu:892 'invalid device ordinal'
...

Even when using only one machine (slark), the error persists:

INFO:root:# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
sniper: Test CUDA failure common.cu:892 'invalid device ordinal'
 .. sniper pid 5087: Test failure common.cu:842

INFO:root:--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1458,1],0]
  Exit code:    2

It seems that the error is related to the dynamic availability of GPUs. I have tried to use CUDA_VISIBLE_DEVICES to control the used GPU, but the outputs are the same w/wo this environment. I would appreciate any guidance on how to resolve this issue and run NCCL tests successfully in this environment.

AddyLaddy commented 10 months ago

The -g 8 parameter to the all_reduce_perf test instructs it to try and configure 8 GPUs on each node. I don't believe you have 8 GPUs in each node.

SweeneyJun commented 10 months ago

Thank you for your assistance in figuring out this erroršŸ˜¢. In light of the fact that I have machines with malfunctioning GPUs, should I set the -g parameter to 2 (as setting it to 3 results in an error when only two GPUs are available)?

I tried using the -g 2 parameter and encountered the following error:

/home/myusername/openmpi/bin/mpirun --mca plm_rsh_agent /usr/bin/ssh --mca btl_tcp_if_include custom-L -np 2 --hostfile ./hostMPI -x LD_LIBRARY_PATH -x PATH -x CUDA_VISIBLE_DEVICES -x NCCL_ALGO -x NCCL_BUFFSIZE -x NCCL_CHECK_POINTERS -x NCCL_COMM_BLOCKING -x NCCL_CROSS_NIC -x NCCL_DEBUG -x NCCL_DMABUF_ENABLE -x NCCL_GDR_READ -x NCCL_GRAPH_MIXING_SUPPORT -x NCCL_GRAPH_REGISTER -x NCCL_IGNORE_CPU_AFFINITY /home/myusername/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 307887 on     sniper device  0 [0x02] NVIDIA TITAN Xp
#  Rank  1 Group  0 Pid 307887 on     sniper device  1 [0x82] NVIDIA TITAN Xp
#  Rank  2 Group  0 Pid   6780 on      slark device  0 [0x03] NVIDIA TITAN Xp
#  Rank  3 Group  0 Pid   6780 on      slark device  1 [0x82] NVIDIA TITAN Xp
sniper:307887:307887 [0] NCCL INFO Bootstrap : Using eno1:210.28.133.161<0>
sniper:307887:307887 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
sniper:307887:307887 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
slark:6780:6780 [0] NCCL INFO cudaDriverVersion 12020
slark:6780:6780 [0] NCCL INFO Bootstrap : Using eno1:210.28.133.162<0>
slark:6780:6780 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
slark:6780:6780 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
slark:6780:6780 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
sniper:307887:307887 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.3+cuda12.0
sniper:307887:307887 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1.
sniper:307887:307907 [0] NCCL INFO Failed to open libibverbs.so[.1]
sniper:307887:307907 [0] NCCL INFO NET/Socket : Using [0]eno1:210.28.133.161<0> [1]custom-L:192.168.3.9<0> [2]wg0:10.8.0.104<0>
sniper:307887:307907 [0] NCCL INFO Using network Socket
sniper:307887:307908 [1] NCCL INFO Using network Socket
slark:6780:6798 [0] NCCL INFO Failed to open libibverbs.so[.1]
slark:6780:6798 [0] NCCL INFO NET/Socket : Using [0]eno1:210.28.133.162<0> [1]enp1s0f0:192.168.3.10<0> [2]custom-L:192.168.3.11<0> [3]wg0:10.8.0.105<0> [4]br-b0a3a0b6e709:172.19.65.1<0> [5]vethd11e498:fe80::88ec:2fff:fe75:10f4%vethd11e498<0> [6]vethf636167:fe80::84a9:63ff:fe4c:9545%vethf636167<0>
slark:6780:6798 [0] NCCL INFO Using network Socket
slark:6780:6799 [1] NCCL INFO Using network Socket
sniper:307887:307907 [0] NCCL INFO NCCL_CHECK_POINTERS set by environment to 0.
sniper:307887:307907 [0] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
slark:6780:6799 [1] NCCL INFO NCCL_CHECK_POINTERS set by environment to 0.
slark:6780:6799 [1] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1.
sniper:307887:307907 [0] NCCL INFO comm 0x55f66b9f5f90 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 2000 commId 0x88cacecf666d8ef4 - Init START
sniper:307887:307908 [1] NCCL INFO comm 0x55f66ba05fa0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 82000 commId 0x88cacecf666d8ef4 - Init START
slark:6780:6799 [1] NCCL INFO comm 0x55d086d97a90 rank 3 nranks 4 cudaDev 1 nvmlDev 1 busId 82000 commId 0x88cacecf666d8ef4 - Init START
slark:6780:6798 [0] NCCL INFO comm 0x55d086d87ac0 rank 2 nranks 4 cudaDev 0 nvmlDev 0 busId 3000 commId 0x88cacecf666d8ef4 - Init START
sniper:307887:307908 [1] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 0.
sniper:307887:307908 [1] NCCL INFO NCCL_CROSS_NIC set by environment to 2.
sniper:307887:307907 [0] NCCL INFO Setting affinity for GPU 0 to 010001
slark:6780:6798 [0] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 0.
slark:6780:6798 [0] NCCL INFO Setting affinity for GPU 0 to 010001
slark:6780:6799 [1] NCCL INFO NCCL_CROSS_NIC set by environment to 2.
sniper:307887:307907 [0] NCCL INFO Channel 00/04 :    0   1   2   3
sniper:307887:307908 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0
sniper:307887:307907 [0] NCCL INFO Channel 01/04 :    0   1   2   3
sniper:307887:307908 [1] NCCL INFO NCCL_BUFFSIZE set by environment to 4194304.
sniper:307887:307907 [0] NCCL INFO Channel 02/04 :    0   1   2   3
sniper:307887:307908 [1] NCCL INFO P2P Chunksize set to 131072
sniper:307887:307907 [0] NCCL INFO Channel 03/04 :    0   1   2   3
sniper:307887:307907 [0] NCCL INFO Trees [0] 1/2/-1->0->-1 [1] 1/2/-1->0->-1 [2] 1/-1/-1->0->2 [3] 1/-1/-1->0->2
sniper:307887:307907 [0] NCCL INFO P2P Chunksize set to 131072
slark:6780:6799 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2
slark:6780:6799 [1] NCCL INFO NCCL_BUFFSIZE set by environment to 4194304.
slark:6780:6799 [1] NCCL INFO P2P Chunksize set to 131072
slark:6780:6799 [1] NCCL INFO NCCL_GRAPH_MIXING_SUPPORT set by environment to 1.
sniper:307887:307907 [0] NCCL INFO NCCL_GRAPH_MIXING_SUPPORT set by environment to 1.
slark:6780:6798 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/-1/-1->2->0 [2] 3/0/-1->2->-1 [3] 3/0/-1->2->-1
slark:6780:6798 [0] NCCL INFO P2P Chunksize set to 131072
sniper:307887:307907 [0] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [receive] via NET/Socket/1
sniper:307887:307907 [0] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [receive] via NET/Socket/2
sniper:307887:307907 [0] NCCL INFO Channel 02/0 : 3[1] -> 0[0] [receive] via NET/Socket/1
sniper:307887:307907 [0] NCCL INFO Channel 03/0 : 3[1] -> 0[0] [receive] via NET/Socket/2
sniper:307887:307907 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
sniper:307887:307907 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
sniper:307887:307907 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
sniper:307887:307907 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
slark:6780:6799 [1] NCCL INFO Channel 00/0 : 3[1] -> 0[0] [send] via NET/Socket/1
slark:6780:6799 [1] NCCL INFO Channel 01/0 : 3[1] -> 0[0] [send] via NET/Socket/2
slark:6780:6799 [1] NCCL INFO Channel 02/0 : 3[1] -> 0[0] [send] via NET/Socket/1
slark:6780:6799 [1] NCCL INFO Channel 03/0 : 3[1] -> 0[0] [send] via NET/Socket/2
slark:6780:6798 [0] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [receive] via NET/Socket/1
sniper:307887:307908 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/1
sniper:307887:307908 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/2
sniper:307887:307908 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[0] [send] via NET/Socket/1
sniper:307887:307908 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[0] [send] via NET/Socket/2
slark:6780:6798 [0] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [receive] via NET/Socket/2
slark:6780:6798 [0] NCCL INFO Channel 02/0 : 1[1] -> 2[0] [receive] via NET/Socket/1
slark:6780:6798 [0] NCCL INFO Channel 03/0 : 1[1] -> 2[0] [receive] via NET/Socket/2
slark:6780:6798 [0] NCCL INFO Channel 00 : 2[0] -> 3[1] via SHM/direct/direct
slark:6780:6798 [0] NCCL INFO Channel 01 : 2[0] -> 3[1] via SHM/direct/direct
slark:6780:6798 [0] NCCL INFO Channel 02 : 2[0] -> 3[1] via SHM/direct/direct
slark:6780:6798 [0] NCCL INFO Channel 03 : 2[0] -> 3[1] via SHM/direct/direct
sniper:307887:307908 [1] NCCL INFO Connected all rings
sniper:307887:307908 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
sniper:307887:307908 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
sniper:307887:307908 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
sniper:307887:307908 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
slark:6780:6806 [1] NCCL INFO misc/socket.cc:504 -> 2 (Operation now in progress)
slark:6780:6806 [1] NCCL INFO misc/socket.cc:567 -> 2
slark:6780:6806 [1] NCCL INFO misc/socket.cc:618 -> 2
slark:6780:6806 [1] NCCL INFO transport/net_socket.cc:333 -> 2
slark:6780:6806 [1] NCCL INFO transport/net.cc:592 -> 2
slark:6780:6806 [1] NCCL INFO proxy.cc:1306 -> 2
slark:6780:6806 [1] NCCL INFO proxy.cc:1377 -> 2

slark:6780:6806 [1] proxy.cc:1519 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 2

slark:6780:6799 [1] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer slark<52087>
slark:6780:6799 [1] NCCL INFO misc/socket.cc:749 -> 6

slark:6780:6799 [1] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f3c80e2d180
slark:6780:6799 [1] NCCL INFO transport/net.cc:288 -> 3
slark:6780:6799 [1] NCCL INFO transport.cc:148 -> 3
slark:6780:6798 [0] NCCL INFO Connected all rings
slark:6780:6799 [1] NCCL INFO init.cc:1079 -> 3
slark:6780:6799 [1] NCCL INFO init.cc:1358 -> 3
slark:6780:6799 [1] NCCL INFO group.cc:65 -> 3 [Async thread]
slark:6780:6798 [0] NCCL INFO misc/socket.cc:46 -> 3
slark:6780:6798 [0] NCCL INFO misc/socket.cc:57 -> 3
slark:6780:6807 [0] NCCL INFO misc/socket.cc:805 -> 3
slark:6780:6798 [0] NCCL INFO misc/socket.cc:772 -> 3
slark:6780:6798 [0] NCCL INFO proxy.cc:1107 -> 3
slark:6780:6798 [0] NCCL INFO proxy.cc:1193 -> 3
slark:6780:6798 [0] NCCL INFO transport/net.cc:226 -> 3
slark:6780:6798 [0] NCCL INFO transport.cc:33 -> 3
slark:6780:6798 [0] NCCL INFO transport.cc:97 -> 3

slark:6780:6807 [0] proxy.cc:1495 NCCL WARN [Service thread] Could not receive type from localRank 0, res=3, closed=0

slark:6780:6807 [0] proxy.cc:1519 NCCL WARN [Proxy Service 2] Failed to execute operation Setup from rank 2, retcode 3
slark:6780:6798 [0] NCCL INFO init.cc:1089 -> 3
slark:6780:6798 [0] NCCL INFO init.cc:1358 -> 3
slark:6780:6798 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
slark:6780:6780 [1] NCCL INFO group.cc:406 -> 3
slark:6780:6780 [1] NCCL INFO group.cc:96 -> 3
slark: Test NCCL failure common.cu:958 'internal error - please report this issue to the NCCL developers / '
 .. slark pid 6780: Test failure common.cu:842

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42182,1],1]
  Exit code:    3
--------------------------------------------------------------------------
SweeneyJun commented 10 months ago

I noticed the following lines in the log, and I suspect that the internal error I encountered might be related to the following information:

sniper:307887:307907 [0] NCCL INFO NET/Socket : Using [0]eno1:210.28.133.161<0> [1]custom-L:192.168.3.9<0> [2]wg0:10.8.0.104<0>ļ¼Œslark:6780:6798 [0] NCCL INFO NET/Socket : Using [0]eno1:210.28.133.162<0> [1]enp1s0f0:192.168.3.10<0> [2]custom-L:192.168.3.11<0> [3]wg0:10.8.0.105<0> [4]br-b0a3a0b6e709:172.19.65.1<0> [5]vethd11e498:fe80::88ec:2fff:fe75:10f4%vethd11e498<0> [6]vethf636167:fe80::84a9:63ff:fe4c:9545%vethf636167<0>
slark:6780:6798 [0] NCCL INFO Using network Socket

These two lines indicate that NCCL may attempt to use multiple network interfaces for discovery, channel establishment, and subsequent communication. In my scenario, I want the traffic to go through the "custom-L" network interface, and both machines' "custom-L" network interfaces are connected to the same switch within the same subnet.

I tried using NCCL_SOCKET_IFNAME==custom-L (where =custom-L should mean using only the "custom-L" network interface, as mentioned in the NCCL environment documentation) and then the tests can run well without showing internal error.