Open hhbb979 opened 6 months ago
qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可
device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
def adaptor(kwargs):
query = kwargs['query']
max_length = kwargs['max_length']
top_p = kwargs['top_p']
temperature = kwargs['temperature']
history = kwargs['history']
return query, max_length, top_p, temperature, history
query, max_length, top_p, temperature, history = adaptor(kwargs)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
]
text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = self._tokenizer([text], return_tensors="pt").to(device)
from transformers import TextIteratorStreamer
streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)
from threading import Thread
generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
thread.start()
response = ""
for new_text in streamer:
response += new_text
yield response
qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可
device = get_conf('LOCAL_MODEL_DEVICE') system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs): def adaptor(kwargs): query = kwargs['query'] max_length = kwargs['max_length'] top_p = kwargs['top_p'] temperature = kwargs['temperature'] history = kwargs['history'] return query, max_length, top_p, temperature, history query, max_length, top_p, temperature, history = adaptor(kwargs) messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": query} ] text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_inputs = self._tokenizer([text], return_tensors="pt").to(device) from transformers import TextIteratorStreamer streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True) from threading import Thread generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512) thread = Thread(target=self._model.generate, kwargs=generation_kwargs) thread.start() response = "" for new_text in streamer: response += new_text yield response
qwen1.5-14b-chat 改成这个代码在1张V100上特别慢,输出结果差不多1秒一个字,GPU使用率100%,不清楚是模型原因还是代码原因 qwen-14b-chat 运行速度相当快,GPU利用率很低,输出很丝滑
qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可
device = get_conf('LOCAL_MODEL_DEVICE') system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs): def adaptor(kwargs): query = kwargs['query'] max_length = kwargs['max_length'] top_p = kwargs['top_p'] temperature = kwargs['temperature'] history = kwargs['history'] return query, max_length, top_p, temperature, history query, max_length, top_p, temperature, history = adaptor(kwargs) messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": query} ] text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) model_inputs = self._tokenizer([text], return_tensors="pt").to(device) from transformers import TextIteratorStreamer streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True) from threading import Thread generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512) thread = Thread(target=self._model.generate, kwargs=generation_kwargs) thread.start() response = "" for new_text in streamer: response += new_text yield response
qwen1.5-14b-chat 改成这个代码在1张V100上特别慢,输出结果差不多1秒一个字,GPU使用率100%,不清楚是模型原因还是代码原因 qwen-14b-chat 运行速度相当快,GPU利用率很低,输出很丝滑
我也是,一秒蹦1个字,好慢,gpu占用满了,好奇怪
已测试qwen1.5的14b,32b,72b,用官方的transformers推理速度都很慢,建议使用vllm或者llama.cpp部署
10s出一个词,显存24G占满。qwen2和1.5测试都很慢,不知道啥原因
Installation Method | 安装方法与平台
OneKeyInstall (一键安装脚本-windows)
Version | 版本
Latest | 最新版
OS | 操作系统
Windows
Describe the bug | 简述
Traceback (most recent call last): File ".\request_llms\local_llm_class.py", line 158, in run for response_full in self.llm_stream_generator(**kwargs): File ".\request_llms\bridge_qwen_local.py", line 46, in llm_stream_generator for response in self._model.chat_stream(self._tokenizer, query, history=history): ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\GPT_academic371\Lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'
Screen Shot | 有帮助的截图
Traceback (most recent call last): File ".\request_llms\local_llm_class.py", line 158, in run for response_full in self.llm_stream_generator(**kwargs): File ".\request_llms\bridge_qwen_local.py", line 46, in llm_stream_generator for response in self._model.chat_stream(self._tokenizer, query, history=history): ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\GPT_academic371\Lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'
Terminal Traceback & Material to Help Reproduce Bugs | 终端traceback(如有) + 帮助我们复现的测试材料样本(如有)
No response