awq compression of llama 2 70b chat got bad result

fancyerii commented 8 months ago

I use awq to quantize llama 2 70b-chat by:

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" python quantize_llama.py

the codes of quantize_llama.py：

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = '/nas/lili/models_hf/70B-chat'
quant_path = '/nas/lili/models_hf/70B-chat-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{
                    "low_cpu_mem_usage": True,
                    "device_map": "auto"
                    })
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

And I serve it with:

CUDA_VISIBLE_DEVICES=0,1 python api_server.py --model /nas/lili/models_hf/70B-chat-awq/ --port 8005  --tensor-parallel-size=2

In my test data, orginal llama 2 70b-chat got 0.581 accuracy. But awq compressed model only got 0.094. What's wrong? my system info: autoawq 0.1.8 transformers 4.36.1 torch 2.1.2

casper-hansen commented 8 months ago

This is unexpected as the testing on Llama models shows minimal accuracy drops.

Did you run your tests using vLLM and is it possible to reproduce the low accuracy in AutoAWQ?
Did you try with a custom dataset for quantization?

fancyerii commented 8 months ago

yes, I ran my tests using vLLM. I will try to use autoawq directly.
I just use the code above. I don't know what's the default dataset.

fancyerii commented 8 months ago

I tried to use autoawq directly:

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    model = AutoAWQForCausalLM.from_quantized(model_path,
                                               fuse_layers=True, device_map="auto",
                                               pad_token_id=tokenizer.eos_token_id
        )

still got 0.08 accuracy.

casper-hansen commented 7 months ago

Hi @fancyerii, it seems the model you quantized did not turn out well. It's a different story for the official llama 2 70b models, so I am unfortunately not able to help you out with this issue.

casper-hansen / AutoAWQ

awq compression of llama 2 70b chat got bad result #292