NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 823 forks source link

Group launch is not efficient when there are a large number of communication collectives #725

Open szhengac opened 2 years ago

szhengac commented 2 years ago

Hi, I just tested the efficiency of training bias only in a large language model. In our system, we bucket the gradients until the total number of gradients exceed certain threshold. Once the bucket is full, we will perform a group launch of a number of reduce-scatter to synchronize the gradients. However, I found it is pretty slow when only bias in the model is learned. After some dive into the issue, I found the bucket has around 980 vectors and the total size is around 21MB, and it means group launch needs to handle 980 reduce-scatter collective calls simultaneously. The Nsight profiling suggests that this call is pretty slow. However, if I concat all the vectors and perform a single reduce-scatter, it is much faster.

Group launch time: 116s A single reduce-scatter time: 1.6s

szhengac commented 2 years ago

My nccl and cuda version: NCCL version 2.13.4+cuda11.7

jbachan commented 2 years ago

Given the current implementation it is expected that a single concatenated 20MB reduce_scatter would significantly outperform ~1000 2K ones. Reduce-scatter is implemented only with the ring algorithm which is optimized for bandwidth but has poor latency.

For now you are encouraged to concatenate your reduce_scatters.

szhengac commented 2 years ago

Do we have any benchmark numbers for different number of collectives in a group? We still prefer group launch for some cases where additional memory copy can be avoided.

jbachan commented 2 years ago

The -m <num-per-group> option to NCCL tests can measure the perf for you.

szhengac commented 2 years ago

thanks @jbachan. i will give it a try.