Open jxh314 opened 1 month ago
why does it appear that only all-reduce has a tree algorithm implementation
Implementing a Tree algorithm for allgather/reducescatter is very hard, contrary to allreduce. The Tree algorithm implements multiple reduce/broadcast operations, so it doesn't apply to allgather/reducescatter.
That being said, in 2.23 we introduced the PAT algorithms which are a variation of brucks algorithm reordering the schedule to keep the memory needs limited. They improve the performance of allgather/reducescatter at scale, but they are still a work in progress and not yet as low-latency as the Tree allreduce. They also only support one GPU per node at the moment, i.e. the intra-node part is not yet implemented.
I'm running nccl-test
all-reduce
between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed that it seems only the all-reduce operation has a tree algorithm implementation. As far as I know, with the FSDP (Fully Sharded Data Parallelism) strategy, it achieves global all-reduce through a combination ofall-gather
andreduce-scatter
. I confirmed this by settingNCCL_DEBUG_SUBSYS=COLL
. So, why does it appear that only all-reduce has a tree algorithm implementation? Would using the ring algorithm when training models with the FSDP strategy result in underutilization of bandwidth?