NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 829 forks source link

Performance Degradation in Multi-Process vs. Multi-Threaded Execution of NCCL Tests on 8 H800 GPUs #1407

Closed polarstormx closed 3 months ago

polarstormx commented 3 months ago

Hello,

I ran the nccl-test on a node with 8 H800 GPUs and found that when using 8 threads to start the alltoall/sendrecv test, the performance was normal. However, when using 8 processes (either with torchrun or starting nccltest with MPI), the performance degraded.

I captured the GPU metrics during runtime using Nsight Systems and found that the NVLink bandwidth decreases.

nccl-test sendrecv_perf: 8 threads(using -t 8) image 8 processes(using mpirun -np 8) image

This phenomenon is more pronounced in the alltoall test. Command for 8 threads:./nccl-tests/build/alltoall_perf -b 4K -e 8G -g 1 -t 8 -f 2 image image

Command for 8 processes:mpirun --hostfile hosts -np 8 ./nccl-tests-mpi/build/alltoall_perf -b 4K -e 8G -g 1 -t 1 -f 2 image image

In my understanding, the two launch methods differ only in how the CPU launch kernels to the GPU, and the execution of subsequent series of kernels on the GPU should be the same. However, the actual performance is not the same. Could you please advise where the issue might be? It appears that the GPU DRAM usage is higher with the multi-process approach. Could this be related to the problem?

sjeaugey commented 3 months ago

There are basically three ways to launch tasks w.r.t. GPUs:

There are pros and cons of the different approaches:

polarstormx commented 3 months ago

@sjeaugey Thanks for your answer! Are these zero copy operations enabled by default? In the code, do DirectRecv and DirectSend in genericOp refer to this?

sjeaugey commented 3 months ago

Yes intra-process it's enabled in most cases, including send/recv (directSend/directRecv allow for zero-copy).