HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
36 stars 41 forks source link

[RFC]: change VLLM_DECODE_BLOCK_BUCKET_* design to fit small AND large batch size at one warmup #328

Open ccrhx4 opened 1 week ago

ccrhx4 commented 1 week ago

Motivation.

In the current design, user cannot set VLLM_DECODE_BLOCK_BUCKET_MIN/MAX/STEP properly for small batch size and large batch size at the same time.

For example, considering requests with input_len 512 and max_output_len 1024, and batch size from 1 to 128.

For bs=1, user needs to set VLLM_DECODE_BLOCKBUCKET* to the min,max of one sequence.

VLLM_DECODE_BLOCK_BUCKET_MIN=1x512/128=4 VLLM_DECODE_BLOCK_BUCKET_STEP=1x128/128=1 VLLM_DECODE_BLOCK_BUCKET_MAX=1x(512+1024)/128=8

For bs=128, user needs to set

VLLM_DECODE_BLOCK_BUCKET_MIN=128x512/128=512 VLLM_DECODE_BLOCK_BUCKET_STEP=128x128/128=128 VLLM_DECODE_BLOCK_BUCKET_MAX=128x(1024+512)/128=1536

Right now, it is not able to set these ENV for bs=1 and bs=128 in one warmup.

Proposed Change.

I am proposing to change VLLM_DECODE_BLOCK_BUCKET_* to VLLM_DECODE_SEQ_*. VLLM_DECODE_SEQ_MIN: min seq length VLLM_DECODE_SEQ_MAX: max seq length VLLM_DECODE_SEQ_STEP: seq length as a step

When warm up graph, vLLM compute graph as: (bs, total_block_number) = ( bs, bs x (VLLM_DECODE_SEQ_MIN + VLLM_DECODE_SEQ_STEP x N) / BLOCK_SIZE)

For bs=1 and bs=128, user can set the VLLM_DECODESEQ* as:

VLLM_DECODE_SEQ_MIN=512 VLLM_DECODE_SEQ_MAX=1024+512=1536 VLLM_DECODE_SEQ_STEP=128

Feedback Period.

No response

CC List.

@kzawora-intel Please kindly provide your feedback. Thanks.

Any Other Things.

No response

Before submitting a new issue...

michalkuligowski commented 6 days ago

@ccrhx4 Hi, please refer to https://github.com/HabanaAI/vllm-fork/pull/345 we were working on if that provides required configuration abilities