Closed fancyerii closed 7 months ago
This is unexpected as the testing on Llama models shows minimal accuracy drops.
I tried to use autoawq directly:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
model = AutoAWQForCausalLM.from_quantized(model_path,
fuse_layers=True, device_map="auto",
pad_token_id=tokenizer.eos_token_id
)
still got 0.08 accuracy.
Hi @fancyerii, it seems the model you quantized did not turn out well. It's a different story for the official llama 2 70b models, so I am unfortunately not able to help you out with this issue.
I use awq to quantize llama 2 70b-chat by:
the codes of quantize_llama.py:
And I serve it with:
In my test data, orginal llama 2 70b-chat got 0.581 accuracy. But awq compressed model only got 0.094. What's wrong? my system info: autoawq 0.1.8 transformers 4.36.1 torch 2.1.2