[Bug] llama2-7b smooth量化后推理性能未提升

zxy1119 commented 3 months ago

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

i use the quantized model by smooth quant ， Why hasn't the inference speed increased? lmdeploy lite smooth_quant /model/llama2-7b-hf/ --work-dir /model/lmdeploy/llama2-7b-w8/ 使用 smooth 量化后，模型文件由 12.56GB 减小到 6.55GB，推理时显存占用只减少了 1GB，推理速度提升也不明显。 python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/llama2-7b-hf/ --backend pytorchresult is 2850.681 token/s but python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/lmdeploy/llama2-7b-w8/ --backend pytorch result is 2896.486 token/s

Reproduction

none

Environment

cuda-11.8 A800

Error traceback

No response

zxy1119 commented 3 months ago

@lvhan028 Could you please help answer my questions?

lvhan028 commented 3 months ago

It is because of the overhead of w8a8 kernel launch. PR #2104 is working on it.

InternLM / lmdeploy