Open josephrocca opened 2 months ago
Have you set TM_DEBUG_LEVEL=DEBUG
(keep loglevel debug), it will insert sync op between cuda functions and will be helpful to find root cause.
Hi @irexyc, thanks for your speedy response. I just tried it and "unfortunately" it fixes the issue :sweat_smile: I.e. inference works fine on v0.6.0 with 2x4090 if TM_DEBUG_LEVEL=DEBUG
env variable is set.
Is there anything else you'd like me to try?
出现一样问题,同样必须执行TM_DEBUG_LEVEL=DEBUG,模型才能运行。不然多卡出现一张卡占用100%锁死问题。
@josephrocca @fanghostt
Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs.
@josephrocca @fanghostt
Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs.
same problem has been shown on 2*A100 GPUS with Qwen2-72B-Instruct-GPTQ-Int4 and InternVL2-40B-AWQ models with lmdeploy version:v0.6.0 Our env may be a little bit different, we use Orion vGPU, not physical machine, any advice to solve this problems?
I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs
@lzhangzz Note that this bug report is about Llama 2 70B. Can you try with Llama 2 70B AWQ instead of Llama 3? Here's my command again from the original post for convenience:
lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --model-name "lmdeploy/llama2-chat-70b-4bit" --server-port 3000 --tp 2 --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level DEBUG
@josephrocca Is it only reproducible using llama2-chat-70b-4bit?
@josephrocca
Sorry for the confusion. Internet access is quite limited on our 4090 environment so I started with what I already have on the machine.
@lvhan028 I have tested multiple Llama 2 70B AWQ models (not just lmdeploy/llama2-chat-70b-4bit
), across multiple GPU types. Unfortunately I haven't tested Llama 3 70B.
(I did try testing LLama 3 70B on 2x4090 just now, but for some reason hit a separate problem with an explicit OOM error - likely an unrelated issue that I just need to spend time to debug. I will look into that issue tomorrow and post a separate issue if needed, but it's likely something that is wrong on my end.)
@josephrocca In my test with Llama3 70B AWQ on 2x4090, --cache-max-entry-count 0.5
is needed to avoid OOM.
Checklist
Describe the bug
Llama2 70B works fine on a dual RTX 4090 machine in v0.5.3, but fails in v0.6.0a0 and v0.6.0. There is no error message given, even with
--log-level DEBUG
.Reproduction
I'm testing on Runpod, using the official Docker images from here: https://hub.docker.com/r/openmmlab/lmdeploy/tags
cu11
andcu12
tags, and there was no difference in behavior.--enable-prefix-caching
and--quant-policy 4
, but this did not fix it.Environment
Error traceback