casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.42k stars 160 forks source link

[Performance degrade]phi-3-medium-128k-instruct after awq quantized, then output repetitively #507

Open Ross-Fan opened 2 weeks ago

Ross-Fan commented 2 weeks ago

phi-3-medium-128k-instruct was quantized by autoawq the quant-config:

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } nothing changed in the quantize.py file

then, run the generator.py as the following:

from awq import AutoAWQForCausalLM
from awq.utils.utils import get_best_device
from transformers import AutoTokenizer, TextStreamer
import argparse
parser = argparse.ArgumentParser()

parser.add_argument('--quant_path',type=str, help='The Quantized Model Path')
parser.add_argument('--prompt',type=str, help='Prompt for generator')
args = parser.parse_args()

quant_path = args.quant_path

# Load model
if get_best_device() == "cpu":
    model = AutoAWQForCausalLM.from_quantized(quant_path, use_qbits=True, fuse_layers=False)
else:
    model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=False)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# prompt = "You're standing on the surface of the Earth. "\
#         "You walk one mile south, one mile west and one mile north. "\
#         "You end up exactly where you started. Where are you?"
prompt = args.prompt

chat = [
    {"role": "system", "content": "You are a concise assistant that helps answer questions."},
    {"role": "user", "content": prompt},
]

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|endoftext|>"),
    tokenizer.convert_tokens_to_ids("<|end|>"),
    tokenizer.convert_tokens_to_ids("<|assistant|>"),
]

tokens = tokenizer.apply_chat_template(
    chat,
    return_tensors="pt"
)
tokens = tokens.to(get_best_device())

# Generate output
generation_output = model.generate(
    tokens,
    streamer=streamer,
    max_new_tokens=1024,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.2,
    top_p=0.95,
    repetition_penalty=1.2
)

print(generation_output)

I changed the fuse_layer=False, because if not, the GPU can't load the model (A100 40GB)

python3 quan-phi3-inference2.py --quant_path ./phi-3-128k-medium-autoawq --prompt "tell me some advice when I workout" the output become repetitively: 截屏2024-06-18 16 06 47

so, any tips for this issue?

HelloCard commented 2 weeks ago

same problem, use vllm and kaitchup/Phi-3-medium-4k-instruct-awq-4bit

badrjd commented 2 weeks ago

same issue, I pushed the quantized model here: https://huggingface.co/bjaidi/Phi-3-medium-128k-instruct-awq

tested with vllm 0.4.2, also compared to gptq on vllm as well and that worked well: https://huggingface.co/Rakuto/Phi-3-medium-4k-instruct-gptq-4bit