chatchat-space / Langchain-Chatchat

Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and Llama) RAG and Agent app with langchain
Apache License 2.0
31.3k stars 5.45k forks source link

[BUG] Qwen-14B-Chat对话速度很慢,比我直接用fastchat启动的api的对话要慢很多 #2077

Closed erjiguan closed 9 months ago

erjiguan commented 10 months ago

问题描述 / Problem Description Qwen-14B-Chat对话速度很慢,比我直接用fastchat生成api的对话要慢很多,目测大概差了10倍以上。 我已经设置了4张显卡全部启动,都没有装flash-attention。 用fastchat的时候,启动方式是: python3 -m fastchat.serve.controller python3 -m fastchat.serve.vllm_worker --model-path /root/.cache/modelscope/hub/qwen/Qwen-14B-Chat --trust-remote-code --tensor-parallel-size 4

预期的结果 / Expected Result 速度和直接用fastchat启动的api对话一致

实际结果 / Actual Result

环境信息 / Environment Information

liunux4odoo commented 10 months ago

在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试?

Jayzhang1995 commented 10 months ago

@erjiguan 请问下你API是怎么传参的,我根据7861端口的示例,使用postman请求一直说格式错误,如果用命令行直接执行没有返回结果,logs目录的日志也没啥输出 curl -X 'POST' \ 'http://127.0.0.1:7861/chat/fastchat' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "chatglm2-6b", "messages": [ { "role": "user", "content": "hello" } ], "temperature": 0.7, "n": 1, "max_tokens": 0, "stop": [], "stream": false, "presence_penalty": 0, "frequency_penalty": 0 }'

image

erjiguan commented 10 months ago

在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试?

不行,速度没有变化还是很慢

erjiguan commented 10 months ago

@erjiguan 请问下你API是怎么传参的,我根据7861端口的示例,使用postman请求一直说格式错误,如果用命令行直接执行没有返回结果,logs目录的日志也没啥输出 curl -X 'POST' 'http://127.0.0.1:7861/chat/fastchat' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "chatglm2-6b", "messages": [ { "role": "user", "content": "hello" } ], "temperature": 0.7, "n": 1, "max_tokens": 0, "stop": [], "stream": false, "presence_penalty": 0, "frequency_penalty": 0 }'

image

我还在试性能和效果,暂时还没有试api

erjiguan commented 10 months ago

在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试?

是因为没有打开vllm推理加速框架,我打开后还有些问题,首先是startup.py里没有import worker_id,然后我又import之后,问问题会报错: {'base_url': 'http://127.0.0.1:7864', 'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}} {'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}} 2023-11-16 19:55:42,826 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:20001/list_models "HTTP/1.1 200 OK" INFO: 127.0.0.1:58920 - "POST /llm_model/list_running_models HTTP/1.1" 200 OK 2023-11-16 19:55:42,828 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_running_models "HTTP/1.1 200 OK" INFO: 127.0.0.1:58920 - "POST /llm_model/list_config_models HTTP/1.1" 200 OK 2023-11-16 19:55:42,830 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_config_models "HTTP/1.1 200 OK" {'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}} 2023-11-16 19:55:42,864 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:20001/list_models "HTTP/1.1 200 OK" INFO: 127.0.0.1:58920 - "POST /llm_model/list_running_models HTTP/1.1" 200 OK 2023-11-16 19:55:42,866 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_running_models "HTTP/1.1 200 OK" 2023-11-16 19:55:42.872 Uncaught app exception Traceback (most recent call last): File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 541, in _run_script exec(code, module.dict) File "/root/chatchat/Langchain-Chatchat/webui.py", line 64, in pages[selected_page]["func"](api=api, is_lite=is_lite) File "/root/chatchat/Langchain-Chatchat/webui_pages/dialogue/dialogue.py", line 178, in dialogue_page chat_box.output_messages() File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_chatbox/messages.py", line 337, in output_messages self.show_feedback(history_index=i, feedback_kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_chatbox/messages.py", line 309, in show_feedback return streamlit_feedback(kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_feedback/init.py", line 97, in streamlit_feedback component_value = _component_func( File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/components/v1/components.py", line 80, in call return self.create_instance(*args, default=default, key=key, kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/metrics_util.py", line 367, in wrapped_func result = non_optional_func(*args, *kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/components/v1/components.py", line 241, in create_instance return_value = marshall_component(dg, element) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/components/v1/components.py", line 212, in marshall_component component_state = register_widget( File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_option_menu/streamlit_callback.py", line 20, in wrapper_register_widget return register_widget(args, kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/state/widgets.py", line 161, in register_widget return register_widget_from_metadata(metadata, ctx, widget_func_name, element_type) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/state/widgets.py", line 194, in register_widget_from_metadata raise DuplicateWidgetID( streamlit.errors.DuplicateWidgetID: There are multiple widgets with the same key=''.

To fix this, please make sure that the key argument is unique for each widget you create.

zRzRzRzRzRzRzR commented 9 months ago

无法复现,0.2.8的时候我重新测了一次单机多卡的Qwen的最新模型,没有出现问题,确保你的框架依赖正确,Qwen模型的配置文件是他们官方最新发的一版

kirinrin commented 4 months ago

我实测了,qwen+fast+vllmworker qwen+vllm 速度是一样的。真正影响速度的是 模型,4int就是快。另外就是一些参数 --max-model-len 6000 --quantization gptq --dtype float16 这些的