NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

how to support One Device per Process? #221

Closed jiangxiaobin96 closed 1 month ago

jiangxiaobin96 commented 2 months ago

How to support One Device per Process when mpirun?

kiskra-nvidia commented 2 months ago

Does the following example clarify things for you?

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-2-one-device-per-process-or-thread

If you are asking how to invoke mpirun with this example, that is cluster-dependent. But, in general, you should invoke with number of processes per node equal to the number of GPUs per node. You don't need to set any per-process CUDA_VISIBLE_DEVICES variables, etc.; in fact, it's better if you don't (the above example figures it out on its own).

jiangxiaobin96 commented 2 months ago

Thanks for your reply. I have followed this user guide. However I want to run some benchmark like nccl-tests. Does nccl-tests support One Device per Process and how to run.

AddyLaddy commented 2 months ago

nccl-tests supports one process per GPU when compiled with MPI=1 and run with mpirun or the SLURM equivalent. MPI is needed to launch the processes and for the MPI_Broadcast/MPI_Gather/MPI_Barrier/MPI_Allreduce calls within the benchmarking code. For example, to run an 8 process job on a single node I would typically run this using OpenMPI:

mpirun -np 8 -N 8 --oversubscribe --bind-to none -x LD_LIBRARY_PATH -x NCCL_DEBUG=WARN --mca btl tcp,self ./build/all_reduce_perf -b8 -e16G -f2 -g1 -t1
jiangxiaobin96 commented 1 month ago

Thanks, I have understood.