Why tree algorithms are specifically targeted at All-Reduce?

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.26k stars 827 forks source link

I'm running nccl-test all-reduce between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed that it seems only the all-reduce operation has a tree algorithm implementation. As far as I know, with the FSDP (Fully Sharded Data Parallelism) strategy, it achieves global all-reduce through a combination of all-gather andreduce-scatter. I confirmed this by setting NCCL_DEBUG_SUBSYS=COLL. So, why does it appear that only all-reduce has a tree algorithm implementation? Would using the ring algorithm when training models with the FSDP strategy result in underutilization of bandwidth?

why does it appear that only all-reduce has a tree algorithm implementation

Implementing a Tree algorithm for allgather/reducescatter is very hard, contrary to allreduce. The Tree algorithm implements multiple reduce/broadcast operations, so it doesn't apply to allgather/reducescatter.

That being said, in 2.23 we introduced the PAT algorithms which are a variation of brucks algorithm reordering the schedule to keep the memory needs limited. They improve the performance of allgather/reducescatter at scale, but they are still a work in progress and not yet as low-latency as the Tree allreduce. They also only support one GPU per node at the moment, i.e. the intra-node part is not yet implemented.

NVIDIA / nccl

Why tree algorithms are specifically targeted at All-Reduce? #1473