Unbabel / COMET

A Neural Framework for MT Evaluation
https://unbabel.github.io/COMET/html/index.html
Apache License 2.0
453 stars 72 forks source link

Quantization #164

Open KnutJaegersberg opened 9 months ago

KnutJaegersberg commented 9 months ago

🚀 Feature

HF transformers implements 8 bit and 4 bit quantization. It would be nice if that feature can be leveraged for the xlm-r-xxl machine translation eval model.

Motivation

The large xlm-r-xxl model is too big for most commodity gpus. To increase access to top performance translation eval, please implement a quantize version.

Alternatives

I have seen a few libraries which quantize bert models outside the HF ecosystem.

Additional context

I tried to load the big model in 8 bit with HF, without autodevice, I could load the model, which then used 14gb vram but I don't know how to use it.

ricardorei commented 9 months ago

Loading on 8bit and using flashattention would be great enhancements. There is a good example of RoBERTa with flash-attention.

ricardorei commented 9 months ago

This also connects to @BramVanroy suggestion to use better transformer (#117 )