[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Checklist
Describe the bug
使用lmdeploy对internvl-26b模型进行推理,GPU型号为NVIDIA L20,首token时延达到了2s多,通过对各阶段进行分析,发现主要时延存在于GPU-CPU的拷贝阶段,代码位置在lmdeploy/vl/engine.py中
出现较大时延的代码段为:
在服务启动后,该传输过程的时延达到1s多,严重拖慢了首token的时延,是否有可能对这部分进行一些优化,减少这个传输过程的影响
Reproduction
启动命令 lmdeploy serve api_server /multimodal/model-zoo/InternVL2-26B --backend turbomind --server-port 23333 --chat-template /multimodal/model-zoo/chat_template/chat_template.json --tp 4 --log-level DEBUG
Environment
Error traceback
No response