Open serser opened 5 months ago
Btw, Qwen has provided the AWQ quant version: https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ
Btw, Qwen has provided the AWQ quant version: https://huggingface.co/Qwen/Qwen2-72B-Instruct-AWQ
Cool, is it directly loadable from LMDeploy?
An update, the quantized model can be loaded with lmdeploy chat ./Qwen2-72B-Instruct-Quant
and it is automatically converted into TurboMind. This model chats fluently when I say hi.
An update, the quantized model can be loaded with
lmdeploy chat ./Qwen2-72B-Instruct-Quant
and it is automatically converted into TurboMind. This model chats fluently when I say hi.
when I load /Qwen2-72B-Instruct-awq , i encountered KeyError: 'model.layers.0.mlp.gate_proj.scales' , have you solved it
Add --model-format awq
please.
Add
--model-format awq
please.
this is my command : lmdeploy serve api_server /yzwl_data/yumu/lmdeploy/lmdeploy/lite/apis/qwen2-7b-w8 --log-level INFO --backend turbomind --model-format awq --model-name qwen --server-port 23334 --tp 2 --max-batch-size 4 --quant-policy 8
and this is the error backtrace :
Traceback (most recent call last):
File "/opt/conda/bin/lmdeploy", line 8, in
w8? Only w4a16 supported for turbomind engine. If you want to run w8a8, please add --backend pytorch
to run pytorch engine.
Update again. Although tp=1
works for the quantized model, when I try to convert it to tp=2
, it runs into the following error since the first dimension of layers.0.feed_forward.w2.scales_zeros
is an odd number 231
. From here I see that the dimension to split is fixed for the scales. Any help to circumvent this issue?
+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------+
| qwen | 1 | UNAVAILABLE: Internal: AssertionError: ('layers.0.feed_forward.w2.scales_zeros', 0, torch.Size([231, 8192])) |
| | | |
| | | At: |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/turbomind/deploy/target_model/base |
| | | .py(247): save_split |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/turbomind/deploy/target_model/base |
| | | .py(274): export |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py(161): __ini |
| | | t__ |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py(253): _build |
| | | _turbomind |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py(161): __init__ |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/turbomind/turbomind.py(387): from_pretrained |
| | | /home/me/qwen-2-72b-instruct/venv2/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py(253): _build_turbomind |
+-------+---------+-----------------------------------------------------------------------------------------------------------------------------------+
Currently, lmdeploy only supports group_size 128.
Checklist
Describe the bug
When quantizing Qwen2-72B-Instruct, it fails with the assertion at this layer
model.layers.2.mlp.gate_proj
by,I tried
--search-scale
from the issue #1656, which runs into the same error.It turns out that when calculating weight scales, many groups of weights happen to be all zeros which caused zero division. I changed to the following to avoid it,
A similar assertion happens when calculating smoothed scales here, I changed into the following,
Then the quantization went without error. I haven't checked the accuracy of the model yet. Please correct me I've made mistakes.
Reproduction
Environment
Error traceback
No response