[Bug] 0.6.2 vs 0.4.2 qwen1.5b模型，0.6.2推理性能差距有慢3倍

InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

https://lmdeploy.readthedocs.io/en/latest/

Apache License 2.0

4.74k stars 432 forks source link

[Bug] 0.6.2 vs 0.4.2 qwen1.5b模型，0.6.2推理性能差距有慢3倍 #2752

Open xliangwu opened 2 weeks ago

xliangwu commented 2 weeks ago

Checklist

[X] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

线上一直用的0.4.2版本，最近需要用function call,所以升级到最新版本，发现推理性能慢了好几几倍。

Reproduction

0.6.2

0.4.2 baf4597731421fdbdf33bdba7073dbc0

Environment

3090Ti,centos

Error traceback

No response

xliangwu commented 2 weeks ago

补充下信息： lmdeploy 之后的参数： 2024-11-13 12:35:13,900 - lmdeploy - INFO - async_engine.py:168 - updated backend_config=TurbomindEngineConfig(dtype='auto', model_format='hf', tp=1, session_len=16384, max_batch_size=12, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=True, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=8192, max_prefill_iters=2)

观察显卡使用率，发现0.4.2 可以达到98%，但是0.6.2 只能跑到40%

lvhan028 commented 2 weeks ago

麻烦提供下复现方式

lmdeploy 每次发版都会做模型精度评测和推理速度测试，一直都是符合预期的。

lvhan028 commented 2 weeks ago

cc @zhulinJulia24

zhulinJulia24 commented 2 weeks ago

a100上单卡0.6.2.post1版本lmdeploy，使用benchmark测速符合预期，显卡利用率99% 符合预期的

xliangwu commented 2 weeks ago