Closed xiangqi1997 closed 3 weeks ago
@xiangqi1997 hi, do use use same settings for pipeline
and api_server
? For instance, --cache-max-entry-count 0.1
is used to set kv cache mem after loading model and it's too small.
sorry my problem, 现在server可以正常使用🙏
Checklist
Describe the bug
参考https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/multi_modal/cogvlm.md, 使用pipeline可以推理得到结果,但使用api_server时,可以启动服务,但调用时卡住,服务端只有GET,没有POST,显示GPU100%利用率卡住,请问是什么原因?
Reproduction
lmdeploy serve api_server ~/.cache/huggingface/hub/models--THUDM--cogvlm2-llama3-chinese-chat-19B/snapshots/d88b352bce5ee58a289b1ac8328553eb31efa2ef/ --backend pytorch --tp 2 --cache-max-entry-count 0.1 --session-len 4096
Environment
Error traceback
No response