bigcode-project / Megatron-LM

Ongoing research training transformer models at scale
Other
373 stars 48 forks source link

Benchmarking Memory Consumption of Optimizers Adam v.s. Adan #20

Open SivilTaram opened 1 year ago

SivilTaram commented 1 year ago

Benchmarking Results

The memory benchmarking is conducted based on the following config:

Head Layers Emb. Dim Model Size (MB) Adam Peak (MB) Adan Peak (MB) $\Delta$ (%)
6 6 768 81 4490 4490 0
12 6 768 81 5848 5848 0
16 6 768 81 6776 6776 0
6 12 768 124 7151 7153 0.03
12 12 768 124 9869 9871 0.02
16 12 768 124 11733 11735 0.02
16 6 1024 128 7302 7304 0.03
16 12 1024 203 12719 12721 0.02
6 24 768 209 12471 12475 0.03
12 24 768 209 17922 17922 0
16 24 768 209 21596 21600 0.02
6 6 1536 248 6905 8241 19.35
12 6 1536 248 8235 8539 3.69
16 6 1536 248 9141 9445 3.33
16 24 1024 354 23530 23534 0.02
16 6 2048 407 11098 12159 9.56
6 12 1536 418 11137 13778 23.71
12 12 1536 418 13390 14164 5.78
16 12 1536 418 15667 15976 1.97
16 6 2560 603 13967 18207 30.36
16 12 2048 709 18851 20954 11.16
6 24 1536 758 19660 24819 26.24
12 24 1536 758 25096 25406 1.24
16 24 1536 758 28720 29030 1.08
16 12 2560 1075 28475 32134 12.85
16 24 2048 1313 34357 38595 12.34

Conclusion