NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

Why does NCCL pass a pointer rather than `struct ncclDevKernelArgs` itself to `ncclKernelMain`? #1404

Closed YconquestY closed 3 months ago

YconquestY commented 3 months ago
// src/device/common.h

template<int SpecializedFnId, typename SpecializedRunWorkBatch>
__device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* args) {
  …
  // Copy kernel args to shmem and then only read those. Otherwise the compiler
  // will end up putting the args into thread local stack which is very wasteful.
  if (tid < sizeof(ncclDevKernelArgs)/sizeof(uint32_t)) {
    ((uint32_t*)&ncclShmem.args)[tid] = ((uint32_t*)args)[tid];
  }
  …
}

The comment says kernel parameters are put in thread local stack by the compiler. But according to CUDA 12.1 Supports Large Kernel Parameters, kernel parameters are passed from host to device via constant memory. So is it really necessary for NCCL to pass a pointer and load from this address instead of simply passing the struct?

jbachan commented 3 months ago

Passing the pointer is necessary because this struct is just the beginning of the much larger ncclDevkernalArgs4K struct which holds up to 4KB of work metadata. We need the base address for use in loadWorkBatchToShmem.

As for copying this small struct to smem first it probably is unnecessary. I think this was a defensive move when I was considering modifying values within the struct. If you modify a variable in constant memory, that's when the compiler silently moves it to thread local memory first. Now that you've provoked me to scrutinize it I believe just reading from the pointer ought to be a little better because the compiler can prove that constant memory doesn't change, whereas with smem it has to pessimistically reload it.

YconquestY commented 3 months ago

I see. Thank you.