Quantization - Githubissues

KnutJaegersberg commented 1 year ago

🚀 Feature

HF transformers implements 8 bit and 4 bit quantization. It would be nice if that feature can be leveraged for the xlm-r-xxl machine translation eval model.

Motivation

The large xlm-r-xxl model is too big for most commodity gpus. To increase access to top performance translation eval, please implement a quantize version.

Alternatives

I have seen a few libraries which quantize bert models outside the HF ecosystem.

Additional context

I tried to load the big model in 8 bit with HF, without autodevice, I could load the model, which then used 14gb vram but I don't know how to use it.

ricardorei commented 1 year ago

Loading on 8bit and using flashattention would be great enhancements. There is a good example of RoBERTa with flash-attention.

ricardorei commented 1 year ago

This also connects to @BramVanroy suggestion to use better transformer (#117 )

Unbabel / COMET

Quantization #164

🚀 Feature

Motivation

Alternatives

Additional context