Open heya5 opened 5 days ago
Normally we use mpirun
to launch one process per GPU so you don't need -g 8
on the nccl-test command line in that case
@AddyLaddy Thanks!
After changing the -g 8
to -g 1
, I get the error:
g0016:3715149:3715349 [0] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<48059> with error 4, opcode 0, len 0, vendor err 81 (Send)
g0016:3715149:3715349 [0] NCCL INFO transport/net.cc:1008 -> 6
g0016:3715149:3715349 [0] NCCL INFO proxy.cc:679 -> 6
g0016:3715149:3715349 [0] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
g0016:3715153:3715355 [4] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<55431> with error 4, opcode 0, len 0, vendor err 81 (Send)
g0016:3715153:3715355 [4] NCCL INFO transport/net.cc:1008 -> 6
g0016:3715153:3715355 [4] NCCL INFO proxy.cc:679 -> 6
g0016:3715153:3715355 [4] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
I find an issue may help to solve the problem https://github.com/NVIDIA/nccl/issues/928 ,
So I add -x NCCL_NET_GDR_LEVEL=0
to the command,
mpirun --host g0010:8,g0016:8 \
-x LD_LIBRARY_PATH=${NVHPC_ROOT}/comm_libs/nccl/lib:${NVHPC_ROOT}/cuda/lib64 \
-x PATH=${NVHPC_ROOT}/comm_libs/mpi/bin:${NVHPC_ROOT}/compilers/bin:$PATH \
-x NCCL_SOCKET_IFNAME=ib0 \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH \
-x NCCL_PXN_DISABLE=0 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_ALGO=Ring \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_6,mlx5_7,mlx5_8,mlx5_9 \
-x NCCL_NET_GDR_LEVEL=0 \
/home/clouduser/nccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 1
And I get a new error:
g0016:3715926:3716128 [3] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.16<54520> with error 12, opcode 0, len 5350, vendor err 129 (Recv)
g0016:3715926:3716128 [3] NCCL INFO transport/net.cc:1134 -> 6
g0016:3715926:3716128 [3] NCCL INFO proxy.cc:679 -> 6
g0016:3715926:3716128 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
g0010:958137:958336 [3] transport/net_ib.cc:1297 NCCL WARN NET/IB : Got completion from peer 10.26.10.10<51872> with error 12, opcode 0, len 5413, vendor err 129 (Recv)
g0010:958137:958336 [3] NCCL INFO transport/net.cc:1134 -> 6
g0010:958137:958336 [3] NCCL INFO proxy.cc:679 -> 6
g0010:958137:958336 [3] NCCL INFO proxy.cc:858 -> 6 [Proxy Thread]
And I also try to add -x NCCL_IB_GID_INDEX=3
, but still get the error 12.
Yeah, the first NET/IB error looks like the typical one when ACS is not disabled and GDRDMA is used. The second one looks like a typical connection timeout issue when the nodes cannot communicate via the NET/IB device(s).
I'd suggest resolving the ACS issue and also using the perftests suite to check that each node can communicate successfully over the NET/IB devices using something like ib_write_bw
or similar.
Also be careful with NCCL_IB_HCA=mlx5_1
as that will select all NICs with that prefix, so mlx5_10, mlx5_11 etc. If those exist on this platform it may have not been your intention for NCCL to select them. You can instead use NCCL_IB_HCA==mlx5_1
etc.
I compile nccl-tests with the command:
And run the command to test the
all_reduce_perf
:And I got the error:
NOTE: When I directly use openmpi instead of nvhpc, the test run successfully.