Open SachaHu opened 1 week ago
使用多卡推理internlm2_5-7b-chat-4bit
lmdeploy serve api_server /data/models/internlm-7b-chat-int4 --backend turbomind --model-format awq --chat-template internlm2 --tp 2
推理服务正常运行。所以大概能排除多卡推理的问题。
export TM_DEBUG_LEVEL=DEBUG 启动服务时,加上选项 --log-level=DEBUG, 看看日志报错在哪里
export TM_DEBUG_LEVEL=DEBUG 启动服务时,加上选项 --log-level=DEBUG, 看看日志报错在哪里
@lvhan028 这是日志最后的部分,报错大概是[TM][DEBUG] getPtr with type i4, but data type is: x
[TM][DEBUG] T turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] T turbomind::Tensor::getPtr() const [with T = nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: decoder_output
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] T turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] T turbomind::Tensor::getPtr() const [with T = nv_bfloat16] start
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: last_token_hidden_units
[TM][DEBUG] T turbomind::Tensor::getPtr() const [with T = __nv_bfloat16] start
[TM][DEBUG] run syncAndCheck at /lmdeploy/src/turbomind/models/llama/unified_decoder.cc:148
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_query
[TM][DEBUG] run syncAndCheck at /lmdeploy/src/turbomind/models/llama/unified_decoder.cc:148
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: layer_id
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: input_query
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: layer_id
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_k_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_cu_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: cu_k_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_cu_k_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_cu_q_len
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: hidden_features
[TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: h_cu_k_len
[TM][DEBUG] void turbomind::UnifiedAttentionLayer
Checklist
Describe the bug
4卡T4的服务器,使用lmdeploy部署internlm2_5-7b-chat,张量并行tp=2 模型成功加载到显存,api接口服务正常。 调用推理接口,模型报错aborted,进程结束
使用internlm/internlm2_5-7b-chat-4bit,可以在单卡正常部署
Reproduction
使用modelscope
export LMDEPLOY_USE_MODELSCOPE=True
用cli工具部署服务lmdeploy serve api_server Shanghai_AI_Laboratory/internlm2_5-7b-chat --backend turbomind --chat-template internlm2 --tp 2
用其他机器post请求推理接口ip:23333/v1/chat/completions
{ "model": "/root/.cache/modelscope/hub/Shanghai_AI_Laboratory/internlm2_5-7b-chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "讲一个三国故事"} ], "temperature": 0.7, "top_p": 0.8 }
进程报错跳出(lmdeploy) [root@local-gpu models]# lmdeploy serve api_server Shanghai_AI_Laboratory/internlm2_5-7b-chat --backend turbomind --chat-template internlm2 --tp 2 [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!! INFO: Started server process [32752] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit) 已放弃
Environment
Error traceback