[BUG] Qwen-14B-Chat对话速度很慢，比我直接用fastchat启动的api的对话要慢很多

erjiguan commented 10 months ago

问题描述 / Problem Description Qwen-14B-Chat对话速度很慢，比我直接用fastchat生成api的对话要慢很多，目测大概差了10倍以上。我已经设置了4张显卡全部启动，都没有装flash-attention。用fastchat的时候，启动方式是： python3 -m fastchat.serve.controller python3 -m fastchat.serve.vllm_worker --model-path /root/.cache/modelscope/hub/qwen/Qwen-14B-Chat --trust-remote-code --tensor-parallel-size 4

预期的结果 / Expected Result 速度和直接用fastchat启动的api对话一致

实际结果 / Actual Result 慢

环境信息 / Environment Information

langchain-ChatChat版本/commit 号：0.2.7
是否使用 Docker 部署（是/否）：否
使用的模型（ChatGLM2-6B / Qwen-7B 等）： Qwen-14B-Chat
使用的 Embedding 模型（moka-ai/m3e-base 等）：text2vec-large-chinese
使用的向量库类型 (faiss / milvus / pg_vector 等)： faiss
操作系统及版本 / Operating system and version:centos7.9
Python 版本 / Python version:3.10.13

liunux4odoo commented 10 months ago

在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试？

Jayzhang1995 commented 10 months ago

@erjiguan 请问下你API是怎么传参的，我根据7861端口的示例，使用postman请求一直说格式错误，如果用命令行直接执行没有返回结果，logs目录的日志也没啥输出 curl -X 'POST' \ 'http://127.0.0.1:7861/chat/fastchat' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "chatglm2-6b", "messages": [ { "role": "user", "content": "hello" } ], "temperature": 0.7, "n": 1, "max_tokens": 0, "stop": [], "stream": false, "presence_penalty": 0, "frequency_penalty": 0 }'

erjiguan commented 10 months ago

在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试？

不行，速度没有变化还是很慢

erjiguan commented 10 months ago

@erjiguan 请问下你API是怎么传参的，我根据7861端口的示例，使用postman请求一直说格式错误，如果用命令行直接执行没有返回结果，logs目录的日志也没啥输出 curl -X 'POST' 'http://127.0.0.1:7861/chat/fastchat' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "chatglm2-6b", "messages": [ { "role": "user", "content": "hello" } ], "temperature": 0.7, "n": 1, "max_tokens": 0, "stop": [], "stream": false, "presence_penalty": 0, "frequency_penalty": 0 }'

我还在试性能和效果，暂时还没有试api

erjiguan commented 10 months ago

在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试？

是因为没有打开vllm推理加速框架，我打开后还有些问题，首先是startup.py里没有import worker_id，然后我又import之后，问问题会报错： {'base_url': 'http://127.0.0.1:7864', 'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}} {'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}} 2023-11-16 19:55:42,826 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:20001/list_models "HTTP/1.1 200 OK" INFO: 127.0.0.1:58920 - "POST /llm_model/list_running_models HTTP/1.1" 200 OK 2023-11-16 19:55:42,828 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_running_models "HTTP/1.1 200 OK" INFO: 127.0.0.1:58920 - "POST /llm_model/list_config_models HTTP/1.1" 200 OK 2023-11-16 19:55:42,830 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_config_models "HTTP/1.1 200 OK" {'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}} 2023-11-16 19:55:42,864 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:20001/list_models "HTTP/1.1 200 OK" INFO: 127.0.0.1:58920 - "POST /llm_model/list_running_models HTTP/1.1" 200 OK 2023-11-16 19:55:42,866 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_running_models "HTTP/1.1 200 OK" 2023-11-16 19:55:42.872 Uncaught app exception Traceback (most recent call last): File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 541, in _run_script exec(code, module.dict) File "/root/chatchat/Langchain-Chatchat/webui.py", line 64, in pages[selected_page]["func"](api=api, is_lite=is_lite) File "/root/chatchat/Langchain-Chatchat/webui_pages/dialogue/dialogue.py", line 178, in dialogue_page chat_box.output_messages() File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_chatbox/messages.py", line 337, in output_messages self.show_feedback(history_index=i, feedback_kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_chatbox/messages.py", line 309, in show_feedback return streamlit_feedback(kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_feedback/init.py", line 97, in streamlit_feedback component_value = _component_func( File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/components/v1/components.py", line 80, in call return self.create_instance(*args, default=default, key=key, kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/metrics_util.py", line 367, in wrapped_func result = non_optional_func(*args, *kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/components/v1/components.py", line 241, in create_instance return_value = marshall_component(dg, element) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/components/v1/components.py", line 212, in marshall_component component_state = register_widget( File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit_option_menu/streamlit_callback.py", line 20, in wrapper_register_widget return register_widget(args, kwargs) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/state/widgets.py", line 161, in register_widget return register_widget_from_metadata(metadata, ctx, widget_func_name, element_type) File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/state/widgets.py", line 194, in register_widget_from_metadata raise DuplicateWidgetID( streamlit.errors.DuplicateWidgetID: There are multiple widgets with the same key=''.

To fix this, please make sure that the key argument is unique for each widget you create.

zRzRzRzRzRzRzR commented 9 months ago

无法复现，0.2.8的时候我重新测了一次单机多卡的Qwen的最新模型，没有出现问题，确保你的框架依赖正确，Qwen模型的配置文件是他们官方最新发的一版

kirinrin commented 4 months ago

我实测了，qwen+fast+vllmworker qwen+vllm 速度是一样的。真正影响速度的是模型，4int就是快。另外就是一些参数 --max-model-len 6000 --quantization gptq --dtype float16 这些的

chatchat-space / Langchain-Chatchat

[BUG] Qwen-14B-Chat对话速度很慢，比我直接用fastchat启动的api的对话要慢很多 #2077