bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.
https://huggingface.co/docs/bitsandbytes/main/en/index
MIT License
5.95k stars 602 forks source link

Stochastic rounding support for 8-bit optimizers #1165

Open drhead opened 4 months ago

drhead commented 4 months ago

Feature request

It would be extremely helpful to have stochastic rounding support, as described in Revisiting BFloat16 Training, for 8-bit optimizers. This would require working with a full-precision set of model updates and a half-precision set of model parameters, and to randomly round up or down on the parameter update with a chance determined by the quantization error. If implemented correctly, this should have minimal additional overhead and should perform similarly to full-precision training.

Motivation

This would allow even more memory savings, where one could fit the entire set of model parameters and an 8bit optimizer in the same space as one full-precision copy of model parameters, or less than that if using 8bit Lion.

Your contribution

Unfortunately, I don't know CUDA, and my understanding is that that would be required to implement this, since optimizer gradient transforms and parameter updates are all handled outside of Python in one inseparable process.

SirTrippsalot commented 3 months ago

I'd love to see this added. Greatly improves BF16 training results on other optimizers.