InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.73k stars 431 forks source link

[Bug] llama2-7b smooth量化后推理性能未提升 #2331

Open zxy1119 opened 3 months ago

zxy1119 commented 3 months ago

Checklist

Describe the bug

i use the quantized model by smooth quant , Why hasn't the inference speed increased? lmdeploy lite smooth_quant /model/llama2-7b-hf/ --work-dir /model/lmdeploy/llama2-7b-w8/ 使用 smooth 量化后,模型文件由 12.56GB 减小到 6.55GB,推理时显存占用只减少了 1GB,推理速度提升也不明显。 python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/llama2-7b-hf/ --backend pytorchresult is 2850.681 token/s but python profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/lmdeploy/llama2-7b-w8/ --backend pytorch result is 2896.486 token/s

Reproduction

none

Environment

cuda-11.8 A800

Error traceback

No response

zxy1119 commented 3 months ago

@lvhan028 Could you please help answer my questions?

lvhan028 commented 3 months ago

It is because of the overhead of w8a8 kernel launch. PR #2104 is working on it.