NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

2 Nodes nccl-test with mpi hangs #229

Closed sdonoso closed 1 month ago

sdonoso commented 1 month ago

I compile with MPI=1, check for same version in the two nodes of OpenMPI. I compile OpenMPI with UCX. When i run the follow:

rene@puente:~/nccl-tests$ mpirun  -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1 -x NCCL_IB_HCA=mlx5_0:1 --allow-run-as-root --bind-to socket --hostfile hostfile -np 2 --mca pml ucx --map-by ppr:1:node  ./build/all_reduce_perf -b 8 -e 8 -f 2 -g 1

The process hang.

If i try the next:

rene@puente:~/nccl-tests$ mpirun -hostfile hostfile -np 2 --mca pml ucx  --map-by ppr:1:node ./hello_world
Hello world from rank 1 out of 2 processors

the process hang after the hello world

The Environment

OS:Ubuntu 22.04

mpirun (Open MPI) 5.0.3

MLNX_OFED_LINUX-24.04-0.6.6.0:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

I have two nodes each one with 8 A100 , if i run with np 1 work well but is not multinode How can i fix it?

kiskra-nvidia commented 1 month ago

You pretty much did your own diagnostics: your MPI installation is not working correctly (MPI hello world doesn't run as expected). This is not a NCCL issue. Ask OpenMPI community for help if you can't figure out the fix on your own.