[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
i use the quantized model by smooth quant , Why hasn't the inference speed increased?
lmdeploy lite smooth_quant /model/llama2-7b-hf/ --work-dir /model/lmdeploy/llama2-7b-w8/
使用 smooth 量化后,模型文件由 12.56GB 减小到 6.55GB,推理时显存占用只减少了 1GB,推理速度提升也不明显。
python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/llama2-7b-hf/ --backend pytorchresult is 2850.681 token/s
but python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/lmdeploy/llama2-7b-w8/ --backend pytorch result is 2896.486 token/s
Checklist
Describe the bug
i use the quantized model by smooth quant , Why hasn't the inference speed increased?
lmdeploy lite smooth_quant /model/llama2-7b-hf/ --work-dir /model/lmdeploy/llama2-7b-w8/
使用 smooth 量化后,模型文件由 12.56GB 减小到 6.55GB,推理时显存占用只减少了 1GB,推理速度提升也不明显。python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/llama2-7b-hf/ --backend pytorch
result is 2850.681 token/s butpython profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/lmdeploy/llama2-7b-w8/ --backend pytorch
result is 2896.486 token/sReproduction
none
Environment
Error traceback
No response