binary-husky / gpt_academic

为GPT/GLM等LLM大语言模型提供实用化交互接口,特别优化论文阅读/润色/写作体验,模块化设计,支持自定义快捷按钮&函数插件,支持Python和C++等项目剖析&自译解功能,PDF/LaTex论文翻译&总结功能,支持并行问询多种LLM模型,支持chatglm3等本地模型。接入通义千问, deepseekcoder, 讯飞星火, 文心一言, llama2, rwkv, claude2, moss等。
https://github.com/binary-husky/gpt_academic/wiki/online
GNU General Public License v3.0
63.46k stars 7.86k forks source link

[Bug]: Qwen1.5-14B-chat 运行不了 #1545

Open hhbb979 opened 6 months ago

hhbb979 commented 6 months ago

Installation Method | 安装方法与平台

OneKeyInstall (一键安装脚本-windows)

Version | 版本

Latest | 最新版

OS | 操作系统

Windows

Describe the bug | 简述

Traceback (most recent call last): File ".\request_llms\local_llm_class.py", line 158, in run for response_full in self.llm_stream_generator(**kwargs): File ".\request_llms\bridge_qwen_local.py", line 46, in llm_stream_generator for response in self._model.chat_stream(self._tokenizer, query, history=history): ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\GPT_academic371\Lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

Screen Shot | 有帮助的截图

Traceback (most recent call last): File ".\request_llms\local_llm_class.py", line 158, in run for response_full in self.llm_stream_generator(**kwargs): File ".\request_llms\bridge_qwen_local.py", line 46, in llm_stream_generator for response in self._model.chat_stream(self._tokenizer, query, history=history): ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\anaconda3\envs\GPT_academic371\Lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2ForCausalLM' object has no attribute 'chat_stream'

Terminal Traceback & Material to Help Reproduce Bugs | 终端traceback(如有) + 帮助我们复现的测试材料样本(如有)

No response

kaltsit33 commented 5 months ago

qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可

device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
        def adaptor(kwargs):
            query = kwargs['query']
            max_length = kwargs['max_length']
            top_p = kwargs['top_p']
            temperature = kwargs['temperature']
            history = kwargs['history']
            return query, max_length, top_p, temperature, history

        query, max_length, top_p, temperature, history = adaptor(kwargs)

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
        text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self._tokenizer([text], return_tensors="pt").to(device)

        from transformers import TextIteratorStreamer
        streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)

        from threading import Thread
        generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        for new_text in streamer:
            response += new_text
            yield response
zerotoone01 commented 5 months ago

qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可

device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
        def adaptor(kwargs):
            query = kwargs['query']
            max_length = kwargs['max_length']
            top_p = kwargs['top_p']
            temperature = kwargs['temperature']
            history = kwargs['history']
            return query, max_length, top_p, temperature, history

        query, max_length, top_p, temperature, history = adaptor(kwargs)

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
        text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self._tokenizer([text], return_tensors="pt").to(device)

        from transformers import TextIteratorStreamer
        streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)

        from threading import Thread
        generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        for new_text in streamer:
            response += new_text
            yield response

qwen1.5-14b-chat 改成这个代码在1张V100上特别慢,输出结果差不多1秒一个字,GPU使用率100%,不清楚是模型原因还是代码原因 qwen-14b-chat 运行速度相当快,GPU利用率很低,输出很丝滑

ZH-007 commented 4 months ago

qwen1.5移除了chat和chat_stream,可以参考https://qwen.readthedocs.io/en/latest/inference/chat.html 只要修改bridge_qwen_local.py中的llm_stream_generator即可

device = get_conf('LOCAL_MODEL_DEVICE')
system_prompt = get_conf('INIT_SYS_PROMPT')
def llm_stream_generator(self, **kwargs):
        def adaptor(kwargs):
            query = kwargs['query']
            max_length = kwargs['max_length']
            top_p = kwargs['top_p']
            temperature = kwargs['temperature']
            history = kwargs['history']
            return query, max_length, top_p, temperature, history

        query, max_length, top_p, temperature, history = adaptor(kwargs)

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ]
        text = self._tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        model_inputs = self._tokenizer([text], return_tensors="pt").to(device)

        from transformers import TextIteratorStreamer
        streamer = TextIteratorStreamer(self._tokenizer, skip_prompt=True, skip_special_tokens=True)

        from threading import Thread
        generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=512)
        thread = Thread(target=self._model.generate, kwargs=generation_kwargs)
        thread.start()

        response = ""
        for new_text in streamer:
            response += new_text
            yield response

qwen1.5-14b-chat 改成这个代码在1张V100上特别慢,输出结果差不多1秒一个字,GPU使用率100%,不清楚是模型原因还是代码原因 qwen-14b-chat 运行速度相当快,GPU利用率很低,输出很丝滑

我也是,一秒蹦1个字,好慢,gpu占用满了,好奇怪

kaltsit33 commented 4 months ago

已测试qwen1.5的14b,32b,72b,用官方的transformers推理速度都很慢,建议使用vllm或者llama.cpp部署

hejian41 commented 2 months ago

10s出一个词,显存24G占满。qwen2和1.5测试都很慢,不知道啥原因