[BUG] <Qwen-72B-Chat-Int4 推理速度为什么比Qwen-72B-Chat慢很多>

vipcong816 commented 9 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

使用cli_demo.py脚本

Qwen-72B-Chat-Int4 和Qwen-72B-Chat 推理速度对比慢很多， Qwen-72B-Chat 速度很快换成Qwen-72B-Chat-Int4 模型，推理变得特别慢，哪位知道是怎么回事么

可以用以下代码实验，以下代码就很快

from transformers import AutoModelForCausalLM, AutoTokenizer import datetime tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()

start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")

替换成Qwen-72B-Chat-Int4就非常慢

from transformers import AutoModelForCausalLM, AutoTokenizer import datetime

tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")

看到官网有这种，不知道是不是int4模型的问题 2A87EE0F-F503-4374-999C-F607A91D7273

期望行为 | Expected Behavior

期望找到问题

复现方法 | Steps To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer import datetime

tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")

运行环境 | Environment

PyTorch: 2.0.1

备注 | Anything else?

无

jklj077 commented 9 months ago

几张卡？auto-gptq版本多少？72B模型同卡数推理，int4慢是符合预期的。

vipcong816 commented 9 months ago

几张卡？auto-gptq版本多少？72B模型同卡数推理，int4慢是符合预期的。

四张卡，auto-gptq 0.6.0

jklj077 commented 9 months ago

我理解这个还是在预期之内的。如果需要更快的推理速度、吞吐，建议使用FastChat + vLLM的部署方式，现在vLLM官方也支持GPTQ量化了。

QwenLM / Qwen