casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.62k stars 193 forks source link

awq quantization is not fully optimized yet. The speed can be slower than non-quantized models #545

Open jackNhat opened 1 month ago

jackNhat commented 1 month ago

When i ran quantize code for llama3-70b-instruct. It was successfull, but when i used vllm load quantized model. I got a warning: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Does that affect the processing speed of this model?

This is my code:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'meta-llama/Meta-Llama-3-70B-Instruct'

quant_path = 'Meta-Llama-3-70B-Instruct-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

vllm==0.4.3, vllm-flash-attn==2.5.8.post2, nccl==2.20.5

casper-hansen commented 1 month ago

Hi @jackNhat, AWQ models are underoptimized in vLLM. The good news is that a the main branch has a new optimization that enables up to 2.59x more performance - this should be released in vllm==0.5.3 in the coming days.

jackNhat commented 1 month ago

Many thanks, i am very looking forward