NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 788 forks source link

How Does NCCL Select Topologies for Different Collective Operations During Machine Learning Training #1415

Open KirilDan opened 3 weeks ago

KirilDan commented 3 weeks ago

For example, I’m running machine learning training jobs on an Amazon EC2 instance with 4 GPUs, and I need some clarification on how NCCL selects the communication topology (e.g., Ring, Tree) for the various collective operations (broadcast, all-reduce, all-gather) that are essential for distributed training.

Given that these training jobs require multiple types of collective operations, I have a few specific questions:

Topology Selection for Each Operation: How does NCCL decide which topology to use for each collective operation during the initialization phase? Specifically, does it select different topologies for different operations (like using Ring for all-reduce and Tree for broadcast), or does it apply the same topology to all operations within a single training run?

Consistency Across Operations: Is it common for NCCL to mix topologies within the same training session—for example, using a Ring topology for one operation and a Tree topology for another? Or does NCCL generally stick to one topology for all operations in a training job? Thank you in advance

kiskra-nvidia commented 2 weeks ago

NCCL has a built-in cost model that it uses at run time to select a particular algorithm (ring, tree, etc.) and protocol. This depends on the hardware configuration of course.

Not all algorithms support every collective operation so, even irrespective of the cost model, we need to mix and match.

Even for a particular collective, the choice of the algorithm is dependent on the operation size (in bytes), as the cost model uses the typical latency/bandwidth calculation.

You can run NCCL with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING to get the latency/bw combos at init time for every collective/algo/proto, as well as specific info on the algo/proto chosen for each collective operation call (the latter uses internal numeric ids that you can look up in src/include/nccl_common.h).