Open littletomatodonkey opened 1 month ago
Multi-block-mode (MBM) works only on long input and long output, if max_new_tokens is 1, MBM does not work.
It is expected because we only use multi-block in generation phase (generating new token). In context phase, we have enough blocks to run in parallel and we don't need to use multi-block.
MBM does not work if input token length <= 4096, which is different from flash decoding?
Leave to other to help replying.
MBM does not work if input token length <= 4096, which is different from flash decoding?
the idea is that we will first try to fully utilize one SM before using more blocks per sequence, so there is a threshold to determine whether we need to enable the multi-block mode or not.
you can always finetune the performance by setting TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=4
, TRTLLM_MMHA_BLOCKS_PER_SEQUENCE
is the variable you can finetune with.
the workflow would be like:
TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=32
.TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=4
. it can be any integer <= 32.MBM does not work if input token length <= 4096, which is different from flash decoding?
the idea is that we will first try to fully utilize one SM before using more blocks per sequence, so there is a threshold to determine whether we need to enable the multi-block mode or not. you can always finetune the performance by setting
TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=4
,TRTLLM_MMHA_BLOCKS_PER_SEQUENCE
is the variable you can finetune with. the workflow would be like:
- building engines: set maximum number of blocks per sequence (32) by
TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=32
.- inference: finetune the number of blocks per sequence (4) by
TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG=1 TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=4
. it can be any integer <= 32.
Thanks for your reply, i'll try now and give a feedback.
Hi, @PerkzZheng
I tested the model with the env, but it seems that the inference cost is not cosistent with env TRTLLM_MMHA_BLOCKS_PER_SEQUENCE
. Does that make sense?
TRTLLM_MMHA_BLOCKS_PER_SEQUENCE=32
: latency avg: 5.79sTRTLLM_MMHA_BLOCKS_PER_SEQUENCE=16
: latency avg: 4.97sTRTLLM_MMHA_BLOCKS_PER_SEQUENCE=8
: latency avg: 4.73sTRTLLM_MMHA_BLOCKS_PER_SEQUENCE=4
: latency avg: 4.77sTRTLLM_MMHA_BLOCKS_PER_SEQUENCE=1
: latency avg: 5.07sunset TRTLLM_MMHA_BLOCKS_PER_SEQUENCE
: latency avg: 4.77sunset TRTLLM_MMHA_BLOCKS_PER_SEQUENCE && unset TRTLLM_ENABLE_MMHA_MULTI_BLOCK_DEBUG
: latency avg: 4.78s@littletomatodonkey splitting one sequence into more blocks doesn't mean you will get more speedups. More blocks would have more reduction overhead, and more waves if you have already utilized full SMs.
@littletomatodonkey splitting one sequence into more blocks doesn't mean you will get more speedups. More blocks would have more reduction overhead, and more waves if you have already utilized full SMs.
Then how can i know whether i fully utilized full SMs and what's the best practice of multi_block_mode
in TensorRT-LLM? Thanks !
@littletomatodonkey splitting one sequence into more blocks doesn't mean you will get more speedups. More blocks would have more reduction overhead, and more waves if you have already utilized full SMs.
Then how can i know whether i fully utilized full SMs and what's the best practice of
multi_block_mode
in TensorRT-LLM? Thanks !
take H100-SXM as an example, you have 132 SMs, and let us say the batch size is 1, num heads is 16, then normally we can split the sequence into (132/16 = 8) blocks to fully utilize all SMs, but if the sequence length is quite small like 1K, it might not worth 8 blocks per sequence (maybe fewer).
Please reopen this ticket if there's further discussion.
Hi, I tested multi-block-mode for TRT-LLM based on Yi-6B model (llama-structure), the performance is as follows. It seems that
Could you please tell me does the conclusion meets the expection of TRT-LLM multi-block-mode? Thanks!
wo mbm
means without multi block mode andwith mbm
means with multi block mode.The convert script is as follows.