Stochastic rounding support for 8-bit optimizers

Feature request

It would be extremely helpful to have stochastic rounding support, as described in Revisiting BFloat16 Training, for 8-bit optimizers. This would require working with a full-precision set of model updates and a half-precision set of model parameters, and to randomly round up or down on the parameter update with a chance determined by the quantization error. If implemented correctly, this should have minimal additional overhead and should perform similarly to full-precision training.

Motivation

This would allow even more memory savings, where one could fit the entire set of model parameters and an 8bit optimizer in the same space as one full-precision copy of model parameters, or less than that if using 8bit Lion.

Your contribution

Unfortunately, I don't know CUDA, and my understanding is that that would be required to implement this, since optimizer gradient transforms and parameter updates are all handled outside of Python in one inseparable process.

bitsandbytes-foundation / bitsandbytes