Evaluate 8 bit optimisers

If I can get RootPainter working with an 8 bit optimiser it could reduce memory requirements and speed up training.

See: https://arxiv.org/pdf/2303.10181.pdf who state: "The use of 8-bit optimiser reduces the GPU memory utilised and the convergence time. The more interesting observation is that in almost all cases (except ViT), it also converges to a better solution, yielding a small performance improvement." One concern is training stability. They also mention: "One reason for the degradation in performance in transformers when using the 8-bit optimiser could be instability during training."

See: Dettmers, T., Lewis, M., Shleifer, S., Zettlemoyer, L.: 8-bit optimizers via block-wise quantization. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=shpkpVXzo3h

https://github.com/TimDettmers/bitsandbytes

Abe404 / root_painter

Evaluate 8 bit optimisers #95