HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
39 stars 48 forks source link

[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size #394

Closed kzawora-intel closed 1 week ago

kzawora-intel commented 1 week ago

This PR adds following functionality that can be enabled via engine flags: