HF transformers implements 8 bit and 4 bit quantization. It would be nice if that feature can be leveraged for the xlm-r-xxl machine translation eval model.
Motivation
The large xlm-r-xxl model is too big for most commodity gpus. To increase access to top performance translation eval, please implement a quantize version.
Alternatives
I have seen a few libraries which quantize bert models outside the HF ecosystem.
Additional context
I tried to load the big model in 8 bit with HF, without autodevice, I could load the model, which then used 14gb vram but I don't know how to use it.
🚀 Feature
HF transformers implements 8 bit and 4 bit quantization. It would be nice if that feature can be leveraged for the xlm-r-xxl machine translation eval model.
Motivation
The large xlm-r-xxl model is too big for most commodity gpus. To increase access to top performance translation eval, please implement a quantize version.
Alternatives
I have seen a few libraries which quantize bert models outside the HF ecosystem.
Additional context
I tried to load the big model in 8 bit with HF, without autodevice, I could load the model, which then used 14gb vram but I don't know how to use it.