NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

non-ring implementation of all-gather? #1123

Open zplizzi opened 11 months ago

zplizzi commented 11 months ago

I just wanted to check if there's any alternative to the ring implementation of all-gather? Possibly taking advantage of SHARP/multicast switches? It seems like ring wouldn't be optimal for large clusters, and all-gather is used quite heavily in zero-style LLM training, such as in FSDP.

Empirically we're struggling to scale a large FSDP LLM model training to ~1000 ranks and are being limited by the all-gather performance, although there's a number of other issues in our setup also that are likely larger factors. I just wanted to make sure I fully understood how all-gathers work in NCCL to aid in debugging. It seems like most attention here has been on all-reduce since in the past that's been more relevant to training performance.

jbachan commented 11 months ago

Indeed we are planning for a SHARP based allgather/reduce_scatter in the upcoming 2.20 release. This should substantially help performance at your scale.

zplizzi commented 11 months ago

Awesome! Is there an ETA on 2.20 yet?

mvpatel2000 commented 11 months ago

@jbachan is there a reason NCCL does not support other algorithms for all-gather? My (very naive) guess would be that some kind of ~tree based all-gather~ (maybe pairwise aggregation?) would be better for large GPU training where ring algorithms are severely affected by latency?

sjeaugey commented 10 months ago

There are several other algorithms which you can easily implement on top of ncclSend/ncclRecv:

The main issue with those algorithms is that they require a perfect network fabric to deliver the expected theoretical gain. Which means a non-blocking network, and perfect adaptive routing for network traffic. They also create a higher number of network connections which can be a good .. or a bad thing.

In comparison, current NCCL algorithms are very nice on the network fabric, with just a limited number of flows going to the top layers, limiting the traffic and allowing for maximum performance even on systems with reduced spine-level bandwidth.

So while we are thinking about adding those algorithms, enabling them automatically will be a challenge unless we have a reliable way to know how the fabric will behave. Given there is no such thing at the moment, we'll have to rely on users enabling those features manually.