Closed keakon closed 1 month ago
Why do you want to set the cache-max-entry-count
to 0.01?
Why do you want to set the
cache-max-entry-count
to 0.01?
To save memory. The weights will take about 90% of the memory, there isn't another 50% for the cache.
I found a solution from a video:
lmdeploy convert --model-format awq --group-size 128 --tp 2 qwen Qwen/Qwen2-72B-Instruct-AWQ
It's weird that lmdeploy convert
is not documented.
It's weird that
lmdeploy convert
is not documented.
It's weird that
lmdeploy convert
is not documented.
However, it doesn't mention that the model needs to be converted using --tp 2. Other LLM serving engines can use a normal model when serving with --tp 2.
I tried this command to achieve maximum performance, and it is about 25% faster than vLLM.
lmdeploy serve api_server ./workspace --model-format awq --tp 2 --max-batch-size 32 --session-len 2048 --cache-max-entry-count 0.45 --quant-policy 8 --enable-prefix-caching
Using "--quant-policy 4" is about 30% faster than vLLM, but I haven't test its accuracy loss.
The disadvantage is the lack of eager mode. In order to serve long contexts like 8192 tokens, I have to reduce "--cache-max-entry-count" to 0.2, which results in poorer performance (almost the same performance as vLLM). There is an option named "--max-seq-len-to-capture" in vLLM, so that when handling long contexts, it can fall back to eager mode to avoid OOM.
My initial human evaluation shows that Q8 quantization and Q0 quantization are almost identical in accuracy, while Q4 is slightly worse. However, in processing texts of around 1k tokens, Q4 is approximately 30% faster than Q8.
I will optimize the accuracy of KV Cache Int4 recently, please stay tuned.
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 5 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 5 days. Please open a new issue if you have similar issues or you have any new updates now.
Checklist
Describe the bug
我想用 2 张 4090 来部署 Qwen/Qwen2-72B-Instruct-AWQ 模型,在 Ollama 上大概 40 GB 显存就能跑,vLLM 上大概要 44 GB 以上,而 lmdeploy 似乎没有办法。
Reproduction
我已经尽量把所有参数都设成最小值了:
Environment
Error traceback