Why are not all SMs active when NCCL kernel and compute kernel overlap?

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.27k stars 826 forks source link

Why are not all SMs active when NCCL kernel and compute kernel overlap? #1432

Open yu-depend opened 2 months ago

yu-depend commented 2 months ago

When I run a single NCCL kernel ,the active SMs is 15%，and When I run a single compute kernel ,the active SMs is 100% ，but when I run the compute kernel and the NCCL kernel in parallel, so that they overlap，the active SMs is 85%, how to explain this？