QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] <Qwen-72B-Chat-Int4 推理速度为什么比Qwen-72B-Chat慢很多> #882

Closed vipcong816 closed 9 months ago

vipcong816 commented 9 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

使用cli_demo.py脚本

Qwen-72B-Chat-Int4 和Qwen-72B-Chat 推理速度对比慢很多, Qwen-72B-Chat 速度很快 换成Qwen-72B-Chat-Int4 模型,推理变得特别慢,哪位知道是怎么回事么

可以用以下代码实验,以下代码就很快

from transformers import AutoModelForCausalLM, AutoTokenizer import datetime tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()

start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")

替换成Qwen-72B-Chat-Int4就非常慢

from transformers import AutoModelForCausalLM, AutoTokenizer import datetime

tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")

看到官网有这种,不知道是不是int4模型的问题 2A87EE0F-F503-4374-999C-F607A91D7273

期望行为 | Expected Behavior

期望找到问题

复现方法 | Steps To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer import datetime

tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")

运行环境 | Environment

PyTorch: 2.0.1

备注 | Anything else?

jklj077 commented 9 months ago

几张卡?auto-gptq版本多少?72B模型同卡数推理,int4慢是符合预期的。

vipcong816 commented 9 months ago

几张卡?auto-gptq版本多少?72B模型同卡数推理,int4慢是符合预期的。

四张卡,auto-gptq 0.6.0

jklj077 commented 9 months ago

我理解这个还是在预期之内的。如果需要更快的推理速度、吞吐,建议使用FastChat + vLLM的部署方式,现在vLLM官方也支持GPTQ量化了。