Feature request: Quantize Google Gemma models

Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852

Apache License 2.0

1.16k stars 175 forks source link

Feature request: Quantize Google Gemma models #29

Closed younesbelkada closed 7 months ago

younesbelkada commented 8 months ago

Hi authors!

With the recent AQLM integration in transformers, would it makes sense to quantize the Google gemma models in 2-bit

The list of the models can be found here: https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b

cc @BlackSamorez

Godofnothing commented 8 months ago

Hi, @younesbelkada. We intend to prepare 2bit gemma models and will come back to you once we have results.

younesbelkada commented 8 months ago

Thanks a lot @Godofnothing ! Looking forward to it !

Godofnothing commented 7 months ago

Hi, @younesbelkada. We have uploaded 2b gemma versions to the huggingface hub:

For some reason, 7b model gemma experiences significant decline in performance, making it unusable. In case we manage to resolve the issue, we will upload 7b models as well.

younesbelkada commented 7 months ago

Very nice thank you @Godofnothing !