casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.67k stars 202 forks source link

awq compression of llama 2 70b chat got bad result #292

Closed fancyerii closed 7 months ago

fancyerii commented 8 months ago

I use awq to quantize llama 2 70b-chat by:

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" python quantize_llama.py

the codes of quantize_llama.py:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/nas/lili/models_hf/70B-chat'
quant_path = '/nas/lili/models_hf/70B-chat-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{
                    "low_cpu_mem_usage": True,
                    "device_map": "auto"
                    })
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

And I serve it with:

CUDA_VISIBLE_DEVICES=0,1 python api_server.py --model /nas/lili/models_hf/70B-chat-awq/ --port 8005  --tensor-parallel-size=2

In my test data, orginal llama 2 70b-chat got 0.581 accuracy. But awq compressed model only got 0.094. What's wrong? my system info: autoawq 0.1.8 transformers 4.36.1 torch 2.1.2

casper-hansen commented 8 months ago

This is unexpected as the testing on Llama models shows minimal accuracy drops.

  1. Did you run your tests using vLLM and is it possible to reproduce the low accuracy in AutoAWQ?
  2. Did you try with a custom dataset for quantization?
fancyerii commented 8 months ago
  1. yes, I ran my tests using vLLM. I will try to use autoawq directly.
  2. I just use the code above. I don't know what's the default dataset.
fancyerii commented 8 months ago

I tried to use autoawq directly:

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    model = AutoAWQForCausalLM.from_quantized(model_path,
                                               fuse_layers=True, device_map="auto",
                                               pad_token_id=tokenizer.eos_token_id
        )

still got 0.08 accuracy.

casper-hansen commented 7 months ago

Hi @fancyerii, it seems the model you quantized did not turn out well. It's a different story for the official llama 2 70b models, so I am unfortunately not able to help you out with this issue.