Closed polarstormx closed 3 months ago
There are basically three ways to launch tasks w.r.t. GPUs:
There are pros and cons of the different approaches:
@sjeaugey Thanks for your answer! Are these zero copy operations enabled by default? In the code, do DirectRecv and DirectSend in genericOp refer to this?
Yes intra-process it's enabled in most cases, including send/recv (directSend/directRecv allow for zero-copy).
Hello,
I ran the nccl-test on a node with 8 H800 GPUs and found that when using 8 threads to start the alltoall/sendrecv test, the performance was normal. However, when using 8 processes (either with torchrun or starting nccltest with MPI), the performance degraded.
I captured the GPU metrics during runtime using Nsight Systems and found that the NVLink bandwidth decreases.
nccl-test sendrecv_perf: 8 threads(using -t 8) 8 processes(using mpirun -np 8)
This phenomenon is more pronounced in the alltoall test. Command for 8 threads:
./nccl-tests/build/alltoall_perf -b 4K -e 8G -g 1 -t 8 -f 2
Command for 8 processes:
mpirun --hostfile hosts -np 8 ./nccl-tests-mpi/build/alltoall_perf -b 4K -e 8G -g 1 -t 1 -f 2
In my understanding, the two launch methods differ only in how the CPU launch kernels to the GPU, and the execution of subsequent series of kernels on the GPU should be the same. However, the actual performance is not the same. Could you please advise where the issue might be? It appears that the GPU DRAM usage is higher with the multi-process approach. Could this be related to the problem?