HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 6 forks source link

Momentum Quantization #14

Closed ClashLuke closed 2 years ago

ClashLuke commented 2 years ago

Many modern optimisers, such as Shampoo, SM3 and 8-Bit Adam quantise the large momentum buffers to a lower precision such as int8. This quantisation gives them significant memory improvements, as they now only need 6 bytes per parameter instead of 12. We could save 16% to 33% of our total memory consumption by adding momentum quantisation, allowing for more parameters and bigger batches.\ This issue is about implementing quantised momentum and benchmarking its convergence impact compared to bf16 and fp32 momentum.

ClashLuke commented 2 years ago

Closed together with #15 by #34