If I can get RootPainter working with an 8 bit optimiser it could reduce memory requirements and speed up training.
See:
https://arxiv.org/pdf/2303.10181.pdf who state: "The use of 8-bit optimiser reduces the GPU memory utilised and the convergence time. The more interesting observation is that in almost all cases (except
ViT), it also converges to a better solution, yielding a small performance improvement."
One concern is training stability. They also mention: "One reason for the degradation in
performance in transformers when using the 8-bit optimiser could be instability during training."
See:
Dettmers, T., Lewis, M., Shleifer, S., Zettlemoyer, L.: 8-bit optimizers via
block-wise quantization. In: International Conference on Learning
Representations (2022), https://openreview.net/forum?id=shpkpVXzo3h
If I can get RootPainter working with an 8 bit optimiser it could reduce memory requirements and speed up training.
See: https://arxiv.org/pdf/2303.10181.pdf who state: "The use of 8-bit optimiser reduces the GPU memory utilised and the convergence time. The more interesting observation is that in almost all cases (except ViT), it also converges to a better solution, yielding a small performance improvement." One concern is training stability. They also mention: "One reason for the degradation in performance in transformers when using the 8-bit optimiser could be instability during training."
See: Dettmers, T., Lewis, M., Shleifer, S., Zettlemoyer, L.: 8-bit optimizers via block-wise quantization. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=shpkpVXzo3h
https://github.com/TimDettmers/bitsandbytes