Closed YconquestY closed 3 months ago
Passing the pointer is necessary because this struct is just the beginning of the much larger ncclDevkernalArgs4K
struct which holds up to 4KB of work metadata. We need the base address for use in loadWorkBatchToShmem
.
As for copying this small struct to smem first it probably is unnecessary. I think this was a defensive move when I was considering modifying values within the struct. If you modify a variable in constant memory, that's when the compiler silently moves it to thread local memory first. Now that you've provoked me to scrutinize it I believe just reading from the pointer ought to be a little better because the compiler can prove that constant memory doesn't change, whereas with smem it has to pessimistically reload it.
I see. Thank you.
The comment says kernel parameters are put in thread local stack by the compiler. But according to CUDA 12.1 Supports Large Kernel Parameters, kernel parameters are passed from host to device via constant memory. So is it really necessary for NCCL to pass a pointer and load from this address instead of simply passing the
struct
?