I'm encountering an issue with the LLaMA 3.1 8B model while using the HPU Docker image. The maximum context length I'm able to input is around 30k tokens, despite the model supporting a context window of over 124k tokens. I'm utilizing 8 Gaudi cards for this setup. Any insights or suggestions would be appreciated!
https://github.com/HabanaAI/vllm-fork/issues/257#issuecomment-2413548759
Model Input Dumps
.
🐛 Describe the bug
.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
I'm encountering an issue with the LLaMA 3.1 8B model while using the HPU Docker image. The maximum context length I'm able to input is around 30k tokens, despite the model supporting a context window of over 124k tokens. I'm utilizing 8 Gaudi cards for this setup. Any insights or suggestions would be appreciated! https://github.com/HabanaAI/vllm-fork/issues/257#issuecomment-2413548759
Model Input Dumps
.
🐛 Describe the bug
.
Before submitting a new issue...