Performance analysis for different values of cache_block_seq_len

ispobock commented 8 months ago

Motivation

@lvhan028 @lzhangzz @grimoire We tested the performance for different values of cache_block_seq_len parameter and got the following results:

cache_block_seq_len	RPS	FTL (min)	FTL (max)	FTL (avg)	PTL (50%)	PTL (75%)	PTL (95%)	PTL (99%)
1	8.759	0.058	4.065	0.437	0.041	0.059	0.116	0.182
8	8.728	0.06	4.072	0.438	0.041	0.059	0.116	0.188
16	8.718	0.068	4.218	0.458	0.042	0.061	0.114	0.181
32	7.218	0.069	4.26	0.449	0.051	0.065	0.129	0.178
64	7.337	0.07	4.265	0.468	0.049	0.064	0.13	0.194
128	7.245	0.061	4.102	0.571	0.049	0.063	0.131	0.189
256	6.944	0.064	5.451	1.844	0.045	0.059	0.13	0.19
512	6.586	0.168	8.909	4.98	0.038	0.046	0.109	0.159
1024	5.838	0.108	14.542	10.414	0.029	0.033	0.074	0.124

(The FTL is the first token latency and the PTL is the per-token latency)

Experiment Settings

backend engine: turbomind
model: llama2_13b_chat
dataset: ShareGPT_V3
batch_size (concurrency): 128
num_prompts: 1000

Findings

The throughput (RPS) is decreasing when we increase the cache_block_seq_len.
We get the largest RPS when cache_block_seq_len = 1.
When cache_block_seq_len >= 256:
- The FTL increased a lot, that means the prefill phase is slower.
- The PTL decreased a lot, that means the decode phase is faster.

Possible Explanation

The prefill phase is computationally intensive and the decode phase is memory access intensive. Larger block size will reduce the times of memory access, but will reduce the GPU memory utilization which will cause smaller batch size and smaller parallelism.

Conclusion

The cache_block_seq_len parameter have significant influence to the system performance. Maybe we need to discuss how to set the default value of the cache_block_seq_len appropriately and tell users this parameter can be tuned.

Related resources

No response

Additional context

Reproduction Procedure

Modify the cache_block_seq_len in ${triton_model_path}/config.ini

Run test command:

python benchmark/profile_throughput.py /workdir/ShareGPT_V3_unfiltered_cleaned_split.json ./workspace --concurrency 128 --num-prompts 1000

Environment


sys.platform: linux
Python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
PyTorch: 2.2.0+cu118
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.8
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.7
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

LMDeploy: 0.2.4+a270a8d transformers: 4.37.2 gradio: 3.50.2 fastapi: 0.109.0 pydantic: 2.6.0

zhyncs commented 8 months ago

Hi @lvhan028 @lzhangzz @grimoire

Perhaps we may consider updating the default value here: https://github.com/InternLM/lmdeploy/blob/be9c15a9f1f360f1d06941a8dd6989af464022e8/lmdeploy/turbomind/deploy/target_model/base.py#L59 Additionally, it might be beneficial to include logging and documentation to guide users on adjusting this parameter for various scenarios.

zhyncs commented 8 months ago

And when cache_block_seq_len == 1, it essentially equals LightLLM's TokenAttention.

lzhangzz commented 8 months ago

When cache_block_seq_len >= 256:

The FTL increased a lot, that means the prefill phase is slower.

The PTL decreased a lot, that means the decode phase is faster.

This happen when you run out of cache blocks. FTL increase because new requests are pending for cache blocks. PTL decrease because the actual batch size become smaller.

Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).

zhyncs commented 8 months ago

When cache_block_seq_len >= 256:

The FTL increased a lot, that means the prefill phase is slower.

The PTL decreased a lot, that means the decode phase is faster.

This happen when you run out of cache blocks. FTL increase because new requests are pending for cache blocks. PTL decrease because the actual batch size become smaller.

Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).

Exactly, as the cache_block_seq_len increases, the actual number of block count decreases, which will result in longer waiting time for new requests. And the latency of decoding decreases, because the increase of cache_block_seq_len results in a reduction in the number of memory accesses to kv cache because more tokens can be accessed at once. The actual batch size you mentioned becoming smaller also has an impact. Is there anything wrong with my understanding?

ispobock commented 8 months ago

Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).

@lzhangzz Here is the test result for 5000 prompts. The RPS is much higher when cache_block_seq_len <= 16.

cache_block_seq_len	RPS	FTL (min)	FTL (max)	FTL (avg)	PTL (50%)	PTL (75%)	PTL (95%)	PTL (99%)
1	10.081	0.057	4.134	0.245	0.044	0.061	0.122	0.187
8	10.097	0.061	4.163	0.257	0.045	0.064	0.122	0.186
16	10.078	0.056	4.222	0.261	0.045	0.064	0.122	0.187
32	7.971	0.067	4.998	0.307	0.055	0.069	0.141	0.193
64	8.042	0.067	6.189	0.441	0.053	0.069	0.142	0.195
128	7.847	0.06	7.229	0.963	0.051	0.068	0.143	0.205
256	7.649	0.062	7.595	2.687	0.047	0.059	0.134	0.188
512	7.258	0.101	11.746,	6.098	0.039	0.046	0.113	0.155
1024	6.299	0.192	17.091	11.746	0.03	0.033	0.076	0.128

zhyncs commented 8 months ago

Also, benchmark results for batch size 128 with only 1000 prompts can be unstable because of severe tail effect (when the response for the last few requests are long).

@lzhangzz Here is the test result for 5000 prompts. The RPS is much higher when cache_block_seq_len <= 16.

cache_block_seq_len RPS FTL (min) FTL (max) FTL (avg) PTL (50%) PTL (75%) PTL (95%) PTL (99%) 1 10.081 0.057 4.134 0.245 0.044 0.061 0.122 0.187 8 10.097 0.061 4.163 0.257 0.045 0.064 0.122 0.186 16 10.078 0.056 4.222 0.261 0.045 0.064 0.122 0.187 32 7.971 0.067 4.998 0.307 0.055 0.069 0.141 0.193 64 8.042 0.067 6.189 0.441 0.053 0.069 0.142 0.195 128 7.847 0.06 7.229 0.963 0.051 0.068 0.143 0.205 256 7.649 0.062 7.595 2.687 0.047 0.059 0.134 0.188 512 7.258 0.101 11.746, 6.098 0.039 0.046 0.113 0.155 1024 6.299 0.192 17.091 11.746 0.03 0.033 0.076 0.128

Hi @lvhan028 @lzhangzz @grimoire Maybe we could change the default value of cache_block_seq_len to 16 based on the benchmark results showing significant advantages for RPS, FTL, and PTL. Do you have any suggestions? Thanks.

lvhan028 commented 8 months ago

@ispobock @zhyncs Thank you so much for such comprehensive experiments. Let me check if it can pass the evaluation test for smaller cache_block_seq_len.

lvhan028 commented 8 months ago

Here are my test results:

When cache_block_seq_len=16, the inference result is in chaos. It can be reproduced by lmdeploy chat turbomind <turbomind_model_path>
When cache_block_seq_len=32, the model evaluation results shown in the below table seems fine. The test methods is presented in this guide.
```
I agree with exposing `cache_block_seq_len` in the API. However, the default value cannot be set to 16 right now.
```

@lzhangzz May follow it up.

dataset version metric mode Llama-2-7b-chat-hf

--------- 考试 Exam --------- - - - - ceval - naive_average gen 28.44 agieval - - - - mmlu - naive_average gen 35.36 GaokaoBench - - - - ARC-c - - - - --------- 语言 Language --------- - - - - WiC d06864 accuracy gen 0.00 summedits - - - - chid-dev - - - - afqmc-dev - - - - bustm-dev - - - - cluewsc-dev - - - - WSC 7902a7 accuracy gen 0.00 winogrande - - - - flores_100 - - - - --------- 知识 Knowledge --------- - - - - BoolQ - - - - commonsense_qa - - - - nq - - - - triviaqa 2121ce score gen 56.13 --------- 推理 Reasoning --------- - - - - cmnli - - - - ocnli - - - - ocnli_fc-dev - - - - AX_b - - - - AX_g - - - - CB - - - - RTE - - - - story_cloze - - - - COPA - - - - ReCoRD - - - - hellaswag - - - - piqa - - - - siqa - - - - strategyqa - - - - math - - - - gsm8k 1d7fe4 accuracy gen 28.28 TheoremQA - - - - openai_humaneval - - - - mbpp - - - - bbh - - - - --------- 理解 Understanding --------- - - - - C3 - - - - CMRC_dev - - - - DRCD_dev - - - - MultiRC - - - - race-middle 9a54b6 accuracy gen 41.78 race-high 9a54b6 accuracy gen 39.28 openbookqa_fact - - - - csl_dev - - - - lcsts - - - - Xsum - - - - eprstmt-dev - - - - lambada - - - - tnews-dev - - - -

ispobock commented 8 months ago

@lvhan028 It seems that the result is only reasonable when cache_block_seq_len is a multiple of 32. For other numbers, the results are almost all in chaos. We need to figure out why it happens.

lvhan028 commented 8 months ago

">=sm80, tile size 32" "sm75,70, tile size 64"

ispobock commented 8 months ago

@lvhan028 Got it. I just try to change the tile size to 16, and the result for cache_block_seq_len = 16 is fine. I also tested the throughput and found the the RPS dropped from 10.078 to 7.734.

cache_block_seq_len	RPS	FTL (min)	FTL (max)	FTL (avg)	PTL (50%)	PTL (75%)	PTL (95%)	PTL (99%)
16	7.734	0.062	4.958	0.276	0.057	0.072	0.143	0.195

So using smaller cache_block_seq_len will not achieve performance gain.

lvhan028 commented 7 months ago

Merged in PR #1218

InternLM / lmdeploy