casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.42k stars 160 forks source link

Gemma2 Support #529

Open yc-wang00 opened 4 days ago

yc-wang00 commented 4 days ago

Hi team, I am opening this issue to request support for the Google Gemma 2 models.

Recently, Google released two models: google/gemma-2-27b and google/gemma-2-9b. For an initial trial, we attempted to use the existing Gemma path for these new models, but it didn't work as expected. Specifically, when I tried to quantize google/gemma-2-9b, the model just produce non-sense outputs.

Could someone please investigate and add support to gemma2?

Thank you very much!!!

casper-hansen commented 2 days ago

I made an initial attempt that did not work. https://github.com/casper-hansen/AutoAWQ/compare/main...gemma2. Unfortunately, I do not have enough time at the moment to do further research on how to support the new architecture.

The biggest change I see for quantizing the model is that it now has a pre-feedforward and post-feedforward layernorm. So there is some challenge in trying to correctly quantize with AWQ. Maybe @TechxGenus or someone else can help contribute

TechxGenus commented 2 days ago

There are still many issues (logits soft cap, fp16, sliding window e.g.) in gemma2 community support. I suggest waiting for them all to be resolved.