Closed vipcong816 closed 9 months ago
使用cli_demo.py脚本
Qwen-72B-Chat-Int4 和Qwen-72B-Chat 推理速度对比慢很多, Qwen-72B-Chat 速度很快 换成Qwen-72B-Chat-Int4 模型,推理变得特别慢,哪位知道是怎么回事么
可以用以下代码实验,以下代码就很快
from transformers import AutoModelForCausalLM, AutoTokenizer import datetime tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")
替换成Qwen-72B-Chat-Int4就非常慢
from transformers import AutoModelForCausalLM, AutoTokenizer import datetime
tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()
看到官网有这种,不知道是不是int4模型的问题
期望找到问题
PyTorch: 2.0.1
无
几张卡?auto-gptq版本多少?72B模型同卡数推理,int4慢是符合预期的。
四张卡,auto-gptq 0.6.0
我理解这个还是在预期之内的。如果需要更快的推理速度、吞吐,建议使用FastChat + vLLM的部署方式,现在vLLM官方也支持GPTQ量化了。
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
使用cli_demo.py脚本
Qwen-72B-Chat-Int4 和Qwen-72B-Chat 推理速度对比慢很多, Qwen-72B-Chat 速度很快 换成Qwen-72B-Chat-Int4 模型,推理变得特别慢,哪位知道是怎么回事么
可以用以下代码实验,以下代码就很快
from transformers import AutoModelForCausalLM, AutoTokenizer import datetime tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")
替换成Qwen-72B-Chat-Int4就非常慢
from transformers import AutoModelForCausalLM, AutoTokenizer import datetime
tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()
start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")
看到官网有这种,不知道是不是int4模型的问题
期望行为 | Expected Behavior
期望找到问题
复现方法 | Steps To Reproduce
from transformers import AutoModelForCausalLM, AutoTokenizer import datetime
tokenizer = AutoTokenizer.from_pretrained("Qwen-72B-Chat-Int4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()
start_time = datetime.datetime.now() for response in model.chat_stream(tokenizer, "什么东北菜好吃", history=None): print(f"\nQwen-Chat: {response}") end_time = datetime.datetime.now() execution_time = (end_time - start_time).total_seconds() * 1000 print("Execution Time:", execution_time, "ms")
运行环境 | Environment
备注 | Anything else?
无