Open LiYtao opened 2 weeks ago
Maybe related to #2706
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
遇到了相同的问题,50个实例开TP=2,有几个会卡死,其中一个卡GPU利用率为0,另一个为100%
Checklist
Describe the bug
使用api_server 方式部署internvl2-40B时有概率卡住,使用了4张A100,debug日志的最后一行是
[TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
Reproduction
CUDA_VISIBLE_DEVICES=0,3,4,7 NCCL_P2P_DIRECT_DISABLE=1 lmdeploy serve api_server /var/aigc/model/InternVL2-40B --tp 4 --server-port 23333 --log-level DEBUG
Environment
Error traceback