OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
410 stars 58 forks source link

LOMO优化器使用梯度裁剪导致训练时间翻倍? #150

Closed Jieni05 closed 9 months ago

Jieni05 commented 9 months ago

对于相同的训练设置,我尝试了clip_grad_norm = None和clip_grad_norm = 5.0两种设置,发现跑一轮的时长后者比前者多了接近一倍,这是正常的吗?

KaiLv69 commented 9 months ago

你好,这是正常的。可以参考 https://arxiv.org/pdf/2306.09782.pdf 里的3.3.1小节,在第一次backward过程中难以得到完整的grad norm factor,所以需要两次backward来进行grad norm。 在adalomo中我们使用grouped update norm改进了这一点,不需要开grad norm就能达到稳定的训练。( https://arxiv.org/pdf/2310.10195.pdf 中的3.2小节)

Jieni05 commented 9 months ago

了解了 感谢解答