NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
887 stars 239 forks source link

Test CUDA failure common.cu:941 'invalid device ordinal' when test two nodes with nvhpc #263

Open heya5 opened 5 days ago

heya5 commented 5 days ago

I compile nccl-tests with the command:

make MPI=1 MPI_HOME=${NVHPC_ROOT}/comm_libs/12.4/hpcx/hpcx-2.19/ompi NCCL_HOME=${NVHPC_ROOT}/comm_libs/nccl CUDA_HOME=${NVHPC_ROOT}/cuda

And run the command to test the all_reduce_perf:

mpirun --host g0010:8,g0016:8 \
-x LD_LIBRARY_PATH=${NVHPC_ROOT}/comm_libs/nccl/lib:${NVHPC_ROOT}/cuda/lib64 \
-x PATH=${NVHPC_ROOT}/comm_libs/mpi/bin:${NVHPC_ROOT}/compilers/bin:$PATH \
-x NCCL_SOCKET_IFNAME=ib0 \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH \
-x NCCL_PXN_DISABLE=0 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_ALGO=Ring \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 \
/home/clouduser/nccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 8

And I got the error:

# Using devices
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888386: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888385: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888382: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645592: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888383: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888387: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888384: Test failure common.cu:891
g0010: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0010 pid 888388: Test failure common.cu:891
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645590: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645594: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645588: Test failure common.cu:891
g0016: Test CUDA failure common.cu:941 'invalid device ordinal'
 .. g0016 pid 3645593: Test failure common.cu:891
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21831,1],5]
  Exit code:    2

NOTE: When I directly use openmpi instead of nvhpc, the test run successfully.

AddyLaddy commented 5 days ago

Normally we use mpirun to launch one process per GPU so you don't need -g 8 on the nccl-test command line in that case

heya5 commented 4 days ago

@AddyLaddy Thanks! After changing the -g 8 to -g 1, I get the error:

g0016:3715149:3715349 [0] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<48059> with error 4, opcode 0, len 0, vendor err 81 (Send)
g0016:3715149:3715349 [0] NCCL INFO transport/net.cc:1008 -> 6
g0016:3715149:3715349 [0] NCCL INFO proxy.cc:679 -> 6
g0016:3715149:3715349 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

g0016:3715153:3715355 [4] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<55431> with error 4, opcode 0, len 0, vendor err 81 (Send)
g0016:3715153:3715355 [4] NCCL INFO transport/net.cc:1008 -> 6
g0016:3715153:3715355 [4] NCCL INFO proxy.cc:679 -> 6
g0016:3715153:3715355 [4] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

I find an issue may help to solve the problem https://github.com/NVIDIA/nccl/issues/928 , So I add -x NCCL_NET_GDR_LEVEL=0 to the command,

mpirun --host g0010:8,g0016:8 \
-x LD_LIBRARY_PATH=${NVHPC_ROOT}/comm_libs/nccl/lib:${NVHPC_ROOT}/cuda/lib64 \
-x PATH=${NVHPC_ROOT}/comm_libs/mpi/bin:${NVHPC_ROOT}/compilers/bin:$PATH \
-x NCCL_SOCKET_IFNAME=ib0 \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH \
-x NCCL_PXN_DISABLE=0 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_ALGO=Ring \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 \
-x NCCL_NET_GDR_LEVEL=0 \
/home/clouduser/nccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 1

And I get a new error:

g0016:3715926:3716128 [3] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<54520> with error 12, opcode 0, len 5350, vendor err 129 (Recv)
g0016:3715926:3716128 [3] NCCL INFO transport/net.cc:1134 -> 6
g0016:3715926:3716128 [3] NCCL INFO proxy.cc:679 -> 6
g0016:3715926:3716128 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

g0010:958137:958336 [3] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.10<51872> with error 12, opcode 0, len 5413, vendor err 129 (Recv)
g0010:958137:958336 [3] NCCL INFO transport/net.cc:1134 -> 6
g0010:958137:958336 [3] NCCL INFO proxy.cc:679 -> 6
g0010:958137:958336 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]

And I also try to add -x NCCL_IB_GID_INDEX=3, but still get the error 12.

AddyLaddy commented 4 days ago

Yeah, the first NET/IB error looks like the typical one when ACS is not disabled and GDRDMA is used. The second one looks like a typical connection timeout issue when the nodes cannot communicate via the NET/IB device(s).

I'd suggest resolving the ACS issue and also using the perftests suite to check that each node can communicate successfully over the NET/IB devices using something like ib_write_bw or similar.

Also be careful with NCCL_IB_HCA=mlx5_1 as that will select all NICs with that prefix, so mlx5_10, mlx5_11 etc. If those exist on this platform it may have not been your intention for NCCL to select them. You can instead use NCCL_IB_HCA==mlx5_1 etc.