InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.11k stars 280 forks source link

[Bug] PyTorch Engine poor performance compared to vllm #1449

Open jjjjohnson opened 2 months ago

jjjjohnson commented 2 months ago

Checklist

Describe the bug

I tried to benchmark the PyTorch Engine performance and find it very poor...

PyTorch Engine concurrency: 4 input token Throughput: 101.53 tokens/s output token Throughput: 93.32 tokens/s total token Throughput: 194.85 tokens/s

vllm concurrency: 4 input token Throughput: 184.18 tokens/s output token Throughput: 169.28 tokens/s total token Throughput: 353.46 tokens/s

Is it normal? Do I miss something when use PyTorch Engine?

Reproduction

model: Qwen14B GPU: A100 LMDeploy: 0.3.0 Dataset: ShareGPT_V3_unfiltered_cleaned_split.json profile_restful_api.py

Environment

LMDeploy: 0.3.0

Error traceback

No response

wanzhenchn commented 2 months ago

I also find that the performance of pytorch backend is about 50% of turbomind backend in https://github.com/InternLM/lmdeploy/issues/1370

The the performance of LMDeploy ( turbomind backend ) and VLLM is comparable, and LMDeploy is even better in fact.

grimoire commented 2 months ago

concurrency: 4 is too small for benchmark. It's better to enable a large concurrency. Qwen has not been fully optimized, we have not apply custom kernel on rotary embedding. this pr replace apply_rotary_pos_emb with our custom kernel. Please have a try.

# server
lmdeploy serve api_server \
   /path/to/Qwen-14B-Chat \
    --server-port 23333 \
    --backend pytorch \
    --cache-max-entry-count 0.95 \
    --max-batch-size 256
# client
python3 \
    lmdeploy/benchmark/profile_restful_api.py \
    http://0.0.0.0:23333 \
   /path/to/Qwen-14B-Chat \
    ShareGPT_V3_unfiltered_cleaned_split.json \
    --num_prompts 3000 \
    --concurrency 256

performance

concurrency: 256
elapsed_time: 491.255s

number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 1264.047 token/s
token throughput (prompt + completion token): 2648.405 token/s
RPS (request per second): 6.107 req/s
RPM (request per minute): 366.408 req/min
jjjjohnson commented 2 months ago

Thanks @grimoire . My usage is low concurrency so it is important to see if it is fast enouth under concurrency: 4.