Open kunger97 opened 1 month ago
@afierka-intel can you check this out? I remember you've experienced a similar bug in weight loading phase of large models (llama405b or mixtral8x7b) on HPU. Should be simple to check if it's caused by the same bug.
refer this change in https://github.com/HabanaAI/vllm-fork/pull/55/ could fix bug, need same change in qwen at least.
like this:
diff --git a/vllm/model_executor/models/qwen2.py b/vllm/model_executor/models/qwen2.py index f38be0e9..c42b67d4 100644 --- a/vllm/model_executor/models/qwen2.py +++ b/vllm/model_executor/models/qwen2.py @@ -371,3 +371,6 @@ class Qwen2ForCausalLM(nn.Module): weight_loader = getattr(param, "weight_loader", default_weight_loader) weight_loader(param, loaded_weight) +
Your current environment
How would you like to use vllm
I'm try to run llm serve use gaudi2 on intel devcloud, i have install vllm-fork and i'm useing command below and it seams shows hpu oom?
PT_HPU_LAZY_MODE=1 vllm serve Qwen/Qwen1.5-32B-Chat --dtype bfloat16 --block-size 128 --device hpu
I also try qwen 13B, it works normally. In addition, when I use optimal-habana to perform inference, it can generate text normally.