When I run a single NCCL kernel ,the active SMs is 15%,and When I run a single compute kernel ,the active SMs is 100% ,but when I run the compute kernel and the NCCL kernel in parallel, so that they overlap,the active SMs is 85%, how to explain this?
When I run a single NCCL kernel ,the active SMs is 15%,and When I run a single compute kernel ,the active SMs is 100% ,but when I run the compute kernel and the NCCL kernel in parallel, so that they overlap,the active SMs is 85%, how to explain this?