Closed erjiguan closed 9 months ago
在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试?
@erjiguan 请问下你API是怎么传参的,我根据7861端口的示例,使用postman请求一直说格式错误,如果用命令行直接执行没有返回结果,logs目录的日志也没啥输出 curl -X 'POST' \ 'http://127.0.0.1:7861/chat/fastchat' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "chatglm2-6b", "messages": [ { "role": "user", "content": "hello" } ], "temperature": 0.7, "n": 1, "max_tokens": 0, "stop": [], "stream": false, "presence_penalty": 0, "frequency_penalty": 0 }'
在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试?
不行,速度没有变化还是很慢
@erjiguan 请问下你API是怎么传参的,我根据7861端口的示例,使用postman请求一直说格式错误,如果用命令行直接执行没有返回结果,logs目录的日志也没啥输出 curl -X 'POST' 'http://127.0.0.1:7861/chat/fastchat' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "model": "chatglm2-6b", "messages": [ { "role": "user", "content": "hello" } ], "temperature": 0.7, "n": 1, "max_tokens": 0, "stop": [], "stream": false, "presence_penalty": 0, "frequency_penalty": 0 }'
我还在试性能和效果,暂时还没有试api
在 server_config 里把模型参数 tensor-parallel-size 改成 4 试一试?
是因为没有打开vllm推理加速框架,我打开后还有些问题,首先是startup.py里没有import worker_id,然后我又import之后,问问题会报错:
{'base_url': 'http://127.0.0.1:7864', 'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}}
{'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}}
2023-11-16 19:55:42,826 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:20001/list_models "HTTP/1.1 200 OK"
INFO: 127.0.0.1:58920 - "POST /llm_model/list_running_models HTTP/1.1" 200 OK
2023-11-16 19:55:42,828 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_running_models "HTTP/1.1 200 OK"
INFO: 127.0.0.1:58920 - "POST /llm_model/list_config_models HTTP/1.1" 200 OK
2023-11-16 19:55:42,830 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_config_models "HTTP/1.1 200 OK"
{'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'https://': None, 'all://': None}}
2023-11-16 19:55:42,864 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:20001/list_models "HTTP/1.1 200 OK"
INFO: 127.0.0.1:58920 - "POST /llm_model/list_running_models HTTP/1.1" 200 OK
2023-11-16 19:55:42,866 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7864/llm_model/list_running_models "HTTP/1.1 200 OK"
2023-11-16 19:55:42.872 Uncaught app exception
Traceback (most recent call last):
File "/root/miniconda3/envs/chatchat_env/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 541, in _run_script
exec(code, module.dict)
File "/root/chatchat/Langchain-Chatchat/webui.py", line 64, in key=''
.
To fix this, please make sure that the key
argument is unique for each
widget you create.
无法复现,0.2.8的时候我重新测了一次单机多卡的Qwen的最新模型,没有出现问题,确保你的框架依赖正确,Qwen模型的配置文件是他们官方最新发的一版
我实测了,qwen+fast+vllmworker qwen+vllm 速度是一样的。真正影响速度的是 模型,4int就是快。另外就是一些参数 --max-model-len 6000 --quantization gptq --dtype float16 这些的
问题描述 / Problem Description Qwen-14B-Chat对话速度很慢,比我直接用fastchat生成api的对话要慢很多,目测大概差了10倍以上。 我已经设置了4张显卡全部启动,都没有装flash-attention。 用fastchat的时候,启动方式是: python3 -m fastchat.serve.controller python3 -m fastchat.serve.vllm_worker --model-path /root/.cache/modelscope/hub/qwen/Qwen-14B-Chat --trust-remote-code --tensor-parallel-size 4
预期的结果 / Expected Result 速度和直接用fastchat启动的api对话一致
实际结果 / Actual Result 慢
环境信息 / Environment Information