[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup

Motivation.

In the current design, user cannot set VLLM_DECODE_BLOCK_BUCKET_MIN/MAX/STEP properly for small batch size and large batch size at the same time.

For example, considering requests with input_len 512 and max_output_len 1024, and batch size from 1 to 128.

For bs=1, user needs to set VLLM_DECODE_BLOCKBUCKET* to the min,max of one sequence.

VLLM_DECODE_BLOCK_BUCKET_MIN=1x512/128=4 VLLM_DECODE_BLOCK_BUCKET_STEP=1x128/128=1 VLLM_DECODE_BLOCK_BUCKET_MAX=1x(512+1024)/128=8

For bs=128, user needs to set

VLLM_DECODE_BLOCK_BUCKET_MIN=128x512/128=512 VLLM_DECODE_BLOCK_BUCKET_STEP=128x128/128=128 VLLM_DECODE_BLOCK_BUCKET_MAX=128x(1024+512)/128=1536

Right now, it is not able to set these ENV for bs=1 and bs=128 in one warmup.

Proposed Change.

I am proposing to change VLLM_DECODE_BLOCK_BUCKET_* to VLLM_DECODE_SEQ_*. VLLM_DECODE_SEQ_MIN: min seq length VLLM_DECODE_SEQ_MAX: max seq length VLLM_DECODE_SEQ_STEP: seq length as a step

When warm up graph, vLLM compute graph as: (bs, total_block_number) = ( bs, bs x (VLLM_DECODE_SEQ_MIN + VLLM_DECODE_SEQ_STEP x N) / BLOCK_SIZE)

For bs=1 and bs=128, user can set the VLLM_DECODESEQ* as:

VLLM_DECODE_SEQ_MIN=512 VLLM_DECODE_SEQ_MAX=1024+512=1536 VLLM_DECODE_SEQ_STEP=128

Feedback Period.

No response

CC List.

@kzawora-intel Please kindly provide your feedback. Thanks.

Any Other Things.

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

HabanaAI / vllm-fork