NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 821 forks source link

Why does NCCL not utilize all channels when the data volume is not large? #1216

Open ltrcc opened 8 months ago

ltrcc commented 8 months ago

The code is as follows:

  size_t bytePerChannel[/*collNetSupport*/2];
  if (comm->channelSize > 0) {
    // Set by user
    bytePerChannel[/*collNetSupport=*/0] = comm->channelSize;
    bytePerChannel[/*collNetSupport=*/1] = comm->channelSize;
  } else {
    // Latency increases as scale increases
    // We would thus want to increase the chunk size to compensate for the lost efficiency
    bytePerChannel[/*collNetSupport=*/0] = NCCL_AGG_CHANNEL_SIZE * std::min(16, comm->nRanks);
    bytePerChannel[/*collNetSupport=*/1] = 256<<10; // Hand-tuned
  }

  for (int collNetSupport=0; collNetSupport < 2; collNetSupport++) {
    while (tasks->collBytesTotal < bytePerChannel[collNetSupport]*comm->nChannels &&
           bytePerChannel[collNetSupport] > NCCL_MIN_CHANNEL_SIZE) {
      // Reduce per-channel size so we utilize all channels.
      bytePerChannel[collNetSupport] /= 2;
    }
  }

My question: Why does NCCL not always use all channels? Sometimes the number of blocks may be smaller than the total number of channels.

crazy-JiangDongHua commented 8 months ago

If the amount of data is too small, nccl will reduce the number of channels used. In function getChannnelThreadInfo:

// Ring/Tree channel tuning
while (collInfo->nBytes < nc * nt * threadThreshold) {
    if (nc >= 2) nc--;
    else break;
}
ltrcc commented 8 months ago

If the amount of data is too small, nccl will reduce the number of channels used. In function getChannnelThreadInfo:

// Ring/Tree channel tuning
while (collInfo->nBytes < nc * nt * threadThreshold) {
    if (nc >= 2) nc--;
    else break;
}

Thank you for your answer, but what I want to ask is the reason for doing this. I recently learned that this seems to be done to maximize the overlap between communication and computation. I'm not sure if this statement is correct, so I raised an issue to confirm it.

AddyLaddy commented 8 months ago

NCCL strives to use the minimum number of kernels in order to reach the best BW. Doing this frees up more SMs for the compute work which is often overlapped with communication. But looking at it another way, why would we want to use more SMs if it doesn't actually improve performance at that message size?

ltrcc commented 8 months ago

NCCL strives to use the minimum number of kernels in order to reach the best BW. Doing this frees up more SMs for the compute work which is often overlapped with communication. But looking at it another way, why would we want to use more SMs if it doesn't actually improve performance at that message size?

NCCL adjusts the number of channels used based on the number of bytes of transmitted data, so I think NCCL designers have considered how many SMs are used to achieve the best effect under certain message sizes.