Open KirilDan opened 3 weeks ago
NCCL has a built-in cost model that it uses at run time to select a particular algorithm (ring, tree, etc.) and protocol. This depends on the hardware configuration of course.
Not all algorithms support every collective operation so, even irrespective of the cost model, we need to mix and match.
Even for a particular collective, the choice of the algorithm is dependent on the operation size (in bytes), as the cost model uses the typical latency/bandwidth calculation.
You can run NCCL with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING
to get the latency/bw combos at init time for every collective/algo/proto, as well as specific info on the algo/proto chosen for each collective operation call (the latter uses internal numeric ids that you can look up in src/include/nccl_common.h
).
For example, I’m running machine learning training jobs on an Amazon EC2 instance with 4 GPUs, and I need some clarification on how NCCL selects the communication topology (e.g., Ring, Tree) for the various collective operations (broadcast, all-reduce, all-gather) that are essential for distributed training.
Given that these training jobs require multiple types of collective operations, I have a few specific questions:
Topology Selection for Each Operation: How does NCCL decide which topology to use for each collective operation during the initialization phase? Specifically, does it select different topologies for different operations (like using Ring for all-reduce and Tree for broadcast), or does it apply the same topology to all operations within a single training run?
Consistency Across Operations: Is it common for NCCL to mix topologies within the same training session—for example, using a Ring topology for one operation and a Tree topology for another? Or does NCCL generally stick to one topology for all operations in a training job? Thank you in advance