Performance Degradation in Multi-Process vs. Multi-Threaded Execution of NCCL Tests on 8 H800 GPUs

polarstormx commented 3 months ago

Hello,

I ran the nccl-test on a node with 8 H800 GPUs and found that when using 8 threads to start the alltoall/sendrecv test, the performance was normal. However, when using 8 processes (either with torchrun or starting nccltest with MPI), the performance degraded.

I captured the GPU metrics during runtime using Nsight Systems and found that the NVLink bandwidth decreases.

nccl-test sendrecv_perf: 8 threads(using -t 8) 8 processes(using mpirun -np 8)

This phenomenon is more pronounced in the alltoall test. Command for 8 threads:./nccl-tests/build/alltoall_perf -b 4K -e 8G -g 1 -t 8 -f 2

Command for 8 processes:mpirun --hostfile hosts -np 8 ./nccl-tests-mpi/build/alltoall_perf -b 4K -e 8G -g 1 -t 1 -f 2

In my understanding, the two launch methods differ only in how the CPU launch kernels to the GPU, and the execution of subsequent series of kernels on the GPU should be the same. However, the actual performance is not the same. Could you please advise where the issue might be? It appears that the GPU DRAM usage is higher with the multi-process approach. Could this be related to the problem?

sjeaugey commented 3 months ago

There are basically three ways to launch tasks w.r.t. GPUs:

One process per GPU (multi process)
One thread per GPU (multi thread)
One thread managing multiple GPUs (single thread)

There are pros and cons of the different approaches:

Multiple processes gives better control over placement. You can ensure the CPU code for each GPU is running on the local CPU socket, with local memory etc.
Single process allows for zero-copy, since we can directly access the other ranks' memory (including GPU memory). That means a reduced amount of work hence higher performance.
Single thread suffers from launch latency given a single thread needs to launch kernels for all GPUs in a serialized manner, whereas multi thread/multi process can launch everything in parallel.

polarstormx commented 3 months ago

@sjeaugey Thanks for your answer! Are these zero copy operations enabled by default? In the code, do DirectRecv and DirectSend in genericOp refer to this?

sjeaugey commented 3 months ago

Yes intra-process it's enabled in most cases, including send/recv (directSend/directRecv allow for zero-copy).

NVIDIA / nccl

Performance Degradation in Multi-Process vs. Multi-Threaded Execution of NCCL Tests on 8 H800 GPUs #1407