NetEase-FuXi / EETQ

Easy and Efficient Quantization for Transformers
Apache License 2.0
157 stars 12 forks source link

Repetition with Llama3-70b and EETQ #22

Open mjsteele12 opened 1 month ago

mjsteele12 commented 1 month ago

First of all, thank you for EETQ!

I am using EETQ with TGI to serve LLama3-70b-instruct. I have noticed that compared to other quants (bnb-nf4, awq) the repetition that I get from Llama3 is significantly higher (all other generation parameters the same). I am doing a lot of structured extraction with TGI's grammar e.g. topic classification/extraction. With EETQ, the responses I get may look like this:

["Math", "Science", "Reading", "Math", "Science", "Reading", "Math", "Science", "Reading", "Math", "Science", "Reading"]

While with other quants I get something like the expected: ["Math", "Science", "Reading"]

For all quants I'm using repetition_penalty of 1.1, temperature .1, top_p .95, but as stated I'm only observing this with EETQ. I have absolutely no idea how to debug this, or if it is even possible, but the repetition issue holds across prompts and inputs so I wanted to share. I'm using the TGI docker container, and the official Llama3-70b-instruct (for bnb-nf4).

I'm wondering if anyone else has come across this or has any insights.

dtlzhuangz commented 1 month ago

The problem may be caused by the precision degradation of quantization. To verify it, please use logger.info in TGI to print the logits of EETQ and compare it with those of the original model or etc. I guess there is no big difference between the logit of ']' and ',' leading to the logit of ',' bigger than that of ']' after quantized by EETQ.