How does collective operations call runRing, runTreeUpDown, and runTreeSplit

The ncclLaunchKernel function plays a pivotal role, being responsible for initiating the execution of NCCL kernels. The implementation of this function relies on CUDA's cudaLaunchKernel API, which is used to enqueue the NCCL kernel for execution.

To thoroughly understand this process, it is essential to delve into the execution mechanism of CUDA kernels. Within the implementation of NCCL, cudaLaunchKernel is the key function that triggers the kernel execution. It accepts a pointer as its first argument, which points to a CUDA kernel function that conforms to a specific signature.

In the source code of NCCL, you may notice that header files such as all_reduce.h and all_gather.h define functions with the device attribute. These functions are restricted to execute on the device side and are called by functions with the global attribute defined in common.cu. The global functions serve as the entry point for CUDA kernels; they are the targets pointed to by the first argument of the cudaLaunchKernel function.

NVIDIA / nccl

How does collective operations call runRing, runTreeUpDown, and runTreeSplit #1300