HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
44 stars 58 forks source link

[Bug]: unable to inference large context length #483

Open pranjalst opened 2 weeks ago

pranjalst commented 2 weeks ago

Your current environment

I'm encountering an issue with the LLaMA 3.1 8B model while using the HPU Docker image. The maximum context length I'm able to input is around 30k tokens, despite the model supporting a context window of over 124k tokens. I'm utilizing 8 Gaudi cards for this setup. Any insights or suggestions would be appreciated! https://github.com/HabanaAI/vllm-fork/issues/257#issuecomment-2413548759

image

image

image

Model Input Dumps

.

🐛 Describe the bug

.

Before submitting a new issue...

michalkuligowski commented 2 weeks ago

@pranjalst what version of vllm and SynapseAI are you using? please show output from collect_env.py

pranjalst commented 2 weeks ago

image image and i am using image

michalkuligowski commented 2 weeks ago

Can you try v0.5.3.post1+Gaudi-1.18.0 ?

pranjalst commented 2 weeks ago

yes I tried it . It's not working.. giving me error