Closed ispobock closed 7 months ago
Hi @lvhan028 @lzhangzz @grimoire
Perhaps we may consider updating the default value here: https://github.com/InternLM/lmdeploy/blob/be9c15a9f1f360f1d06941a8dd6989af464022e8/lmdeploy/turbomind/deploy/target_model/base.py#L59 Additionally, it might be beneficial to include logging and documentation to guide users on adjusting this parameter for various scenarios.
And when cache_block_seq_len == 1
, it essentially equals LightLLM's TokenAttention.
When
cache_block_seq_len
>= 256:
- The FTL increased a lot, that means the
prefill
phase is slower.- The PTL decreased a lot, that means the
decode
phase is faster.
This happen when you run out of cache blocks. FTL increase because new requests are pending for cache blocks. PTL decrease because the actual batch size become smaller.
Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).
When
cache_block_seq_len
>= 256:
- The FTL increased a lot, that means the
prefill
phase is slower.- The PTL decreased a lot, that means the
decode
phase is faster.This happen when you run out of cache blocks. FTL increase because new requests are pending for cache blocks. PTL decrease because the actual batch size become smaller.
Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).
Exactly, as the cache_block_seq_len
increases, the actual number of block count decreases, which will result in longer waiting time for new requests.
And the latency of decoding decreases, because the increase of cache_block_seq_len
results in a reduction in the number of memory accesses to kv cache because more tokens can be accessed at once. The actual batch size you mentioned becoming smaller also has an impact.
Is there anything wrong with my understanding?
Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).
@lzhangzz Here is the test result for 5000 prompts. The RPS is much higher when cache_block_seq_len
<= 16.
cache_block_seq_len | RPS | FTL (min) | FTL (max) | FTL (avg) | PTL (50%) | PTL (75%) | PTL (95%) | PTL (99%) |
---|---|---|---|---|---|---|---|---|
1 | 10.081 | 0.057 | 4.134 | 0.245 | 0.044 | 0.061 | 0.122 | 0.187 |
8 | 10.097 | 0.061 | 4.163 | 0.257 | 0.045 | 0.064 | 0.122 | 0.186 |
16 | 10.078 | 0.056 | 4.222 | 0.261 | 0.045 | 0.064 | 0.122 | 0.187 |
32 | 7.971 | 0.067 | 4.998 | 0.307 | 0.055 | 0.069 | 0.141 | 0.193 |
64 | 8.042 | 0.067 | 6.189 | 0.441 | 0.053 | 0.069 | 0.142 | 0.195 |
128 | 7.847 | 0.06 | 7.229 | 0.963 | 0.051 | 0.068 | 0.143 | 0.205 |
256 | 7.649 | 0.062 | 7.595 | 2.687 | 0.047 | 0.059 | 0.134 | 0.188 |
512 | 7.258 | 0.101 | 11.746, | 6.098 | 0.039 | 0.046 | 0.113 | 0.155 |
1024 | 6.299 | 0.192 | 17.091 | 11.746 | 0.03 | 0.033 | 0.076 | 0.128 |
Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).
@lzhangzz Here is the test result for 5000 prompts. The RPS is much higher when
cache_block_seq_len
<= 16.cache_block_seq_len RPS FTL (min) FTL (max) FTL (avg) PTL (50%) PTL (75%) PTL (95%) PTL (99%) 1 10.081 0.057 4.134 0.245 0.044 0.061 0.122 0.187 8 10.097 0.061 4.163 0.257 0.045 0.064 0.122 0.186 16 10.078 0.056 4.222 0.261 0.045 0.064 0.122 0.187 32 7.971 0.067 4.998 0.307 0.055 0.069 0.141 0.193 64 8.042 0.067 6.189 0.441 0.053 0.069 0.142 0.195 128 7.847 0.06 7.229 0.963 0.051 0.068 0.143 0.205 256 7.649 0.062 7.595 2.687 0.047 0.059 0.134 0.188 512 7.258 0.101 11.746, 6.098 0.039 0.046 0.113 0.155 1024 6.299 0.192 17.091 11.746 0.03 0.033 0.076 0.128
Hi @lvhan028 @lzhangzz @grimoire
Maybe we could change the default value of cache_block_seq_len
to 16
based on the benchmark results showing significant advantages for RPS, FTL, and PTL. Do you have any suggestions? Thanks.
@ispobock @zhyncs Thank you so much for such comprehensive experiments.
Let me check if it can pass the evaluation test for smaller cache_block_seq_len
.
Here are my test results:
cache_block_seq_len=16
, the inference result is in chaos. It can be reproduced by lmdeploy chat turbomind <turbomind_model_path>
cache_block_seq_len=32
, the model evaluation results shown in the below table seems fine. The test methods is presented in this guide.
I agree with exposing `cache_block_seq_len` in the API. However, the default value cannot be set to 16 right now.
@lzhangzz May follow it up.
dataset version metric mode Llama-2-7b-chat-hf
--------- 考试 Exam --------- - - - - ceval - naive_average gen 28.44 agieval - - - - mmlu - naive_average gen 35.36 GaokaoBench - - - - ARC-c - - - - --------- 语言 Language --------- - - - - WiC d06864 accuracy gen 0.00 summedits - - - - chid-dev - - - - afqmc-dev - - - - bustm-dev - - - - cluewsc-dev - - - - WSC 7902a7 accuracy gen 0.00 winogrande - - - - flores_100 - - - - --------- 知识 Knowledge --------- - - - - BoolQ - - - - commonsense_qa - - - - nq - - - - triviaqa 2121ce score gen 56.13 --------- 推理 Reasoning --------- - - - - cmnli - - - - ocnli - - - - ocnli_fc-dev - - - - AX_b - - - - AX_g - - - - CB - - - - RTE - - - - story_cloze - - - - COPA - - - - ReCoRD - - - - hellaswag - - - - piqa - - - - siqa - - - - strategyqa - - - - math - - - - gsm8k 1d7fe4 accuracy gen 28.28 TheoremQA - - - - openai_humaneval - - - - mbpp - - - - bbh - - - - --------- 理解 Understanding --------- - - - - C3 - - - - CMRC_dev - - - - DRCD_dev - - - - MultiRC - - - - race-middle 9a54b6 accuracy gen 41.78 race-high 9a54b6 accuracy gen 39.28 openbookqa_fact - - - - csl_dev - - - - lcsts - - - - Xsum - - - - eprstmt-dev - - - - lambada - - - - tnews-dev - - - -
@lvhan028 It seems that the result is only reasonable when cache_block_seq_len
is a multiple of 32. For other numbers, the results are almost all in chaos. We need to figure out why it happens.
">=sm80, tile size 32" "sm75,70, tile size 64"
@lvhan028 Got it. I just try to change the tile size to 16, and the result for cache_block_seq_len
= 16 is fine. I also tested the throughput and found the the RPS dropped from 10.078 to 7.734.
cache_block_seq_len | RPS | FTL (min) | FTL (max) | FTL (avg) | PTL (50%) | PTL (75%) | PTL (95%) | PTL (99%) |
---|---|---|---|---|---|---|---|---|
16 | 7.734 | 0.062 | 4.958 | 0.276 | 0.057 | 0.072 | 0.143 | 0.195 |
So using smaller cache_block_seq_len
will not achieve performance gain.
Merged in PR #1218
Motivation
@lvhan028 @lzhangzz @grimoire We tested the performance for different values of
cache_block_seq_len
parameter and got the following results:(The FTL is the first token latency and the PTL is the per-token latency)
Experiment Settings
turbomind
llama2_13b_chat
ShareGPT_V3
128
1000
Findings
cache_block_seq_len
.cache_block_seq_len
= 1.cache_block_seq_len
>= 256:prefill
phase is slower.decode
phase is faster.Possible Explanation
prefill
phase is computationally intensive and thedecode
phase is memory access intensive. Larger block size will reduce the times of memory access, but will reduce the GPU memory utilization which will cause smaller batch size and smaller parallelism.Conclusion
The
cache_block_seq_len
parameter have significant influence to the system performance. Maybe we need to discuss how to set the default value of thecache_block_seq_len
appropriately and tell users this parameter can be tuned.Related resources
No response
Additional context
Reproduction Procedure
cache_block_seq_len
in${triton_model_path}/config.ini
Environment
LMDeploy: 0.2.4+a270a8d transformers: 4.37.2 gradio: 3.50.2 fastapi: 0.109.0 pydantic: 2.6.0