Open ltrcc opened 8 months ago
If the amount of data is too small, nccl will reduce the number of channels used. In function getChannnelThreadInfo:
// Ring/Tree channel tuning
while (collInfo->nBytes < nc * nt * threadThreshold) {
if (nc >= 2) nc--;
else break;
}
If the amount of data is too small, nccl will reduce the number of channels used. In function getChannnelThreadInfo:
// Ring/Tree channel tuning while (collInfo->nBytes < nc * nt * threadThreshold) { if (nc >= 2) nc--; else break; }
Thank you for your answer, but what I want to ask is the reason for doing this. I recently learned that this seems to be done to maximize the overlap between communication and computation. I'm not sure if this statement is correct, so I raised an issue to confirm it.
NCCL strives to use the minimum number of kernels in order to reach the best BW. Doing this frees up more SMs for the compute work which is often overlapped with communication. But looking at it another way, why would we want to use more SMs if it doesn't actually improve performance at that message size?
NCCL strives to use the minimum number of kernels in order to reach the best BW. Doing this frees up more SMs for the compute work which is often overlapped with communication. But looking at it another way, why would we want to use more SMs if it doesn't actually improve performance at that message size?
NCCL adjusts the number of channels used based on the number of bytes of transmitted data, so I think NCCL designers have considered how many SMs are used to achieve the best effect under certain message sizes.
The code is as follows:
My question: Why does NCCL not always use all channels? Sometimes the number of blocks may be smaller than the total number of channels.