InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.74k stars 432 forks source link

[Bug] 0.6.2 vs 0.4.2 qwen1.5b模型,0.6.2推理性能差距有慢3倍 #2752

Open xliangwu opened 2 weeks ago

xliangwu commented 2 weeks ago

Checklist

Describe the bug

线上一直用的0.4.2版本,最近需要用function call,所以升级到最新版本,发现推理性能慢了好几几倍。

Reproduction

0.6.2 image

0.4.2 baf4597731421fdbdf33bdba7073dbc0

Environment

3090Ti,centos

Error traceback

No response

xliangwu commented 2 weeks ago

补充下信息: lmdeploy 之后的参数: 2024-11-13 12:35:13,900 - lmdeploy - INFO - async_engine.py:168 - updated backend_config=TurbomindEngineConfig(dtype='auto', model_format='hf', tp=1, session_len=16384, max_batch_size=12, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=True, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=8192, max_prefill_iters=2)

观察显卡使用率,发现0.4.2 可以达到98%,但是0.6.2 只能跑到40%

lvhan028 commented 2 weeks ago

麻烦提供下复现方式

lmdeploy 每次发版都会做模型精度评测和推理速度测试,一直都是符合预期的。

lvhan028 commented 2 weeks ago

cc @zhulinJulia24

zhulinJulia24 commented 2 weeks ago

image a100上单卡0.6.2.post1版本lmdeploy,使用benchmark测速符合预期,显卡利用率99% 符合预期的

xliangwu commented 2 weeks ago

谢谢 我准备下我的数据和环境,然后提供复现的数据。