使用ZeRO stage 3比stage 2和stage 1更消耗GPU,很奇怪

Flywolfs commented 1 year ago

按理说ZeRO的stage 3不应该是占用GPU最少但是速度最慢的吗，但是我测试下来发现stage为3消耗的GPU最多，速度最慢。

测试模型：BLOOM 560M & BLOOM 1.1B, training batch 为1

数据就是BELLE发布的belleMath.json。

GPU: TITAN RTX 24GB *2

测试结果： ZeRO Stage=1:

bloom 560M, max长度512, training显存消耗：12.6GB+10.9GB, 训练速度 8.5batch/s

bloom 560M, max长度1024, training显存消耗：12.6GB+11.2GB，训练速度 4batch/s

Bloom 1.1B max长度512，training显存消耗：20.7GB+19.4GB, 训练速度 6batch/s

ZeRO Stage=2:

bloom 560M, max长度512, training显存消耗：：12GB+10.9GB 训练速度6.8batch/s

bloom 560M, max长度1024, training显存消耗：：12GB+11.4GB 训练速度3batch/s

Bloom 1.1B max长度512，training显存消耗：：21.5GB+19.2GB 训练速度4batch/s

ZeRO Stage=3:

bloom 560M, max长度512, training显存消耗：12.8GB+12.9GB 训练速度2.6batch/s

bloom 560M, max长度1024, training显存消耗：12.8GB+12.8GB 训练速度2.6batch/s

Bloom 1.1B max长度512，training显存消耗：21.7GB+21.7GB 训练速度2.37batch/s

可以明显看出Stage=3的显存消耗比1和2都大，并且训练速度最慢，这是为什么呢？

hulkliu77 commented 1 year ago

请问你这3个阶段设置的batch_size值是固定的吗

shishijier commented 1 year ago

遇见了同样的问题，请问你解决了吗？

hulkliu77 commented 1 year ago

没有解决，但是通过分析应该是deepspeed中zero3的通信量过大而导致显存峰值过高。

shishijier @.***> 于2023年7月6日周四 14:02写道：

遇见了同样的问题，请问你解决了吗？

— Reply to this email directly, view it on GitHub https://github.com/LianjiaTech/BELLE/issues/402#issuecomment-1623044114, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3BZ6R4TZHN2WUERNQZMKVLXOZIFRANCNFSM6AAAAAAYNT6S7Q . You are receiving this because you commented.Message ID: @.***>

Junyiliu0 commented 10 months ago

朋友们这个问题解决了吗？

LiChao-cy commented 1 month ago

我测试的是把缓存设置小一些，stage3占用显存是最少的，但是速度依然还是最慢的，把缓存调大，好像对速度也没有什么太大影响

LianjiaTech / BELLE

使用ZeRO stage 3比stage 2和stage 1更消耗GPU,很奇怪 #402