Closed jianh619 closed 2 months ago
In the Nvidia provided containers the MPI=1 built tests are usually called:
/opt/nccl-tests/build/all_reduce_perf_mpi
Did you build /opt/nccl-tests/build/all_reduce_perf
yourself using MPI=1 ?
I encountered the same problem on H100 as well. I built nccl-tests with OpenMPI and did not use Docker.
In the Nvidia provided containers the MPI=1 built tests are usually called:
/opt/nccl-tests/build/all_reduce_perf_mpi
Did you build
/opt/nccl-tests/build/all_reduce_perf
yourself using MPI=1 ?
Yes , I build the image myself , compiling nccl test with MPI=1
BTW , is there official container provided by Nvidia ? Coudl you let me know where I can get the download link?
I solved this issue by using OpenMPI 4.1 instead. I originally built nccl-tests with openmpi 5.0 but it runs separately on each node. After switching to OpenMPI 4.1 and rebuilding it, it works as expected now.
Yes , I build the image myself , compiling nccl test with MPI=1
It sure looks like, for whatever reason, either your MPI compilation or your MPI installation does not work as expected.
Does a simple MPI "hello world" type program work correctly (you know, one that would report the rank and size of MPI_COMM_WORLD
from each launched process)?
Can you verify if your all_reduce_perf
actually uses MPI? Say, check with ldd
if it links with the MPI library:
ldd all_reduce_perf | grep mpi
Or check with nm
if it has any MPI symbols:
nm all_reduce_perf | grep MPI
BTW , is there official container provided by Nvidia ? Coudl you let me know where I can get the download link?
Docker container nvidia/cuda:12.2.2-devel-ubuntu22.04 contains NCCL 2.19.3. A number of containers in Nvidia's NGC catalog (https://catalog.ngc.nvidia.com/) contain NCCL as well. I believe TensorRT does (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt), and PyTorch (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). Also https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks, https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc...
Thanks guys , it should be some reason for compilation , rebuild the image , works now .
I'm running nccl test on two H800 nodes , but seems they run seperately on each node
Test runs with container , and I also follow the guide to compile with MPI setting .
Here's the command
the output as below :
The rank should increases from 0 to 15 , but it repeated from 0 to 7 . And the results show two counts each buffersize .
here's the topo dump file , it works as expected .
and trace log , there's message showing "Failed to find ncclNetPlugin_v8 " , I have no idea it matters or not .