Open jjjjohnson opened 2 months ago
I also find that the performance of pytorch backend is about 50% of turbomind backend in https://github.com/InternLM/lmdeploy/issues/1370
The the performance of LMDeploy ( turbomind backend ) and VLLM is comparable, and LMDeploy is even better in fact.
concurrency: 4
is too small for benchmark. It's better to enable a large concurrency.
Qwen has not been fully optimized, we have not apply custom kernel on rotary embedding. this pr replace apply_rotary_pos_emb
with our custom kernel. Please have a try.
# server
lmdeploy serve api_server \
/path/to/Qwen-14B-Chat \
--server-port 23333 \
--backend pytorch \
--cache-max-entry-count 0.95 \
--max-batch-size 256
# client
python3 \
lmdeploy/benchmark/profile_restful_api.py \
http://0.0.0.0:23333 \
/path/to/Qwen-14B-Chat \
ShareGPT_V3_unfiltered_cleaned_split.json \
--num_prompts 3000 \
--concurrency 256
performance
concurrency: 256
elapsed_time: 491.255s
number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 1264.047 token/s
token throughput (prompt + completion token): 2648.405 token/s
RPS (request per second): 6.107 req/s
RPM (request per minute): 366.408 req/min
Thanks @grimoire . My usage is low concurrency so it is important to see if it is fast enouth under concurrency: 4.
Checklist
Describe the bug
I tried to benchmark the PyTorch Engine performance and find it very poor...
PyTorch Engine
concurrency: 4 input token Throughput: 101.53 tokens/s output token Throughput: 93.32 tokens/s total token Throughput: 194.85 tokens/svllm
concurrency: 4 input token Throughput: 184.18 tokens/s output token Throughput: 169.28 tokens/s total token Throughput: 353.46 tokens/sIs it normal? Do I miss something when use PyTorch Engine?
Reproduction
model: Qwen14B GPU: A100 LMDeploy: 0.3.0 Dataset: ShareGPT_V3_unfiltered_cleaned_split.json profile_restful_api.py
Environment
Error traceback
No response