Open szhengac opened 2 years ago
My nccl and cuda version: NCCL version 2.13.4+cuda11.7
Given the current implementation it is expected that a single concatenated 20MB reduce_scatter would significantly outperform ~1000 2K ones. Reduce-scatter is implemented only with the ring algorithm which is optimized for bandwidth but has poor latency.
For now you are encouraged to concatenate your reduce_scatters.
Do we have any benchmark numbers for different number of collectives in a group? We still prefer group launch for some cases where additional memory copy can be avoided.
The -m <num-per-group>
option to NCCL tests can measure the perf for you.
thanks @jbachan. i will give it a try.
Hi, I just tested the efficiency of training bias only in a large language model. In our system, we bucket the gradients until the total number of gradients exceed certain threshold. Once the bucket is full, we will perform a group launch of a number of reduce-scatter to synchronize the gradients. However, I found it is pretty slow when only bias in the model is learned. After some dive into the issue, I found the bucket has around 980 vectors and the total size is around 21MB, and it means group launch needs to handle 980 reduce-scatter collective calls simultaneously. The Nsight profiling suggests that this call is pretty slow. However, if I concat all the vectors and perform a single reduce-scatter, it is much faster.
Group launch time: 116s A single reduce-scatter time: 1.6s