max-model-len=100000 gets reduced to ~10,000 tokens
With FP8 KV cache, only reaches ~20,000 tokens
Same exact model weights work with full context on vLLM on my 4x 3090s
Get the same behavior with GPTQ
Error:
ERROR: The model's max seq len (100000) is larger than the maximum number of tokens that can
be stored in KV cache (10176). Try increasing `gpu_memory_utilization`, setting
`--enable-chunked-prefill`, or `--kv-cache-dtype fp8`
Question: How can we achieve full context length with these models in Aphrodite?
Your current environment
How would you like to use Aphrodite?
When running Qwen 2.5 (72B) AWQ with this command:
The engine forces down the context length:
max-model-len=100000
gets reduced to ~10,000 tokensError:
Question: How can we achieve full context length with these models in Aphrodite?