NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.16k stars 796 forks source link

How does collective operations call runRing, runTreeUpDown, and runTreeSplit #1300

Open ZhiyiHu1999 opened 4 months ago

ZhiyiHu1999 commented 4 months ago

Hello, I want to ask how collective operations defined in collectives.cc call runRing, runTreeUpDown, and runTreeSplit functions. For example, we have ncclAllReduce() function defined in collectives.cc, how does this function call runRing, runTreeUpDown, and runTreeSplit functions defined in all_reduce.h to run these algorithms. In addition, how does ncclAllReduce() function choose which algorithm to use? ( I barely found a file that includes all_reduce.h, which may strain my confusion). Thanks a lot!

Hizhaoyuan commented 4 months ago

The ncclLaunchKernel function plays a pivotal role, being responsible for initiating the execution of NCCL kernels. The implementation of this function relies on CUDA's cudaLaunchKernel API, which is used to enqueue the NCCL kernel for execution.

To thoroughly understand this process, it is essential to delve into the execution mechanism of CUDA kernels. Within the implementation of NCCL, cudaLaunchKernel is the key function that triggers the kernel execution. It accepts a pointer as its first argument, which points to a CUDA kernel function that conforms to a specific signature.

In the source code of NCCL, you may notice that header files such as all_reduce.h and all_gather.h define functions with the device attribute. These functions are restricted to execute on the device side and are called by functions with the global attribute defined in common.cu. The global functions serve as the entry point for CUDA kernels; they are the targets pointed to by the first argument of the cudaLaunchKernel function.