Closed jiangxiaobin96 closed 1 month ago
Does the following example clarify things for you?
If you are asking how to invoke mpirun
with this example, that is cluster-dependent. But, in general, you should invoke with number of processes per node equal to the number of GPUs per node. You don't need to set any per-process CUDA_VISIBLE_DEVICES
variables, etc.; in fact, it's better if you don't (the above example figures it out on its own).
Thanks for your reply. I have followed this user guide. However I want to run some benchmark like nccl-tests
. Does nccl-tests support One Device per Process and how to run.
nccl-tests supports one process per GPU when compiled with MPI=1
and run with mpirun
or the SLURM equivalent.
MPI is needed to launch the processes and for the MPI_Broadcast/MPI_Gather/MPI_Barrier/MPI_Allreduce
calls within the benchmarking code.
For example, to run an 8 process job on a single node I would typically run this using OpenMPI:
mpirun -np 8 -N 8 --oversubscribe --bind-to none -x LD_LIBRARY_PATH -x NCCL_DEBUG=WARN --mca btl tcp,self ./build/all_reduce_perf -b8 -e16G -f2 -g1 -t1
Thanks, I have understood.
How to support One Device per Process when
mpirun
?