[bucketing overhaul 1/n] Add padding-aware scheduling and option to limit prefill batch size

This PR adds following functionality that can be enabled via engine flags:

use_padding_aware_scheduling - vLLM scheduler will now calculate token cost considering padded prefill shape (similar to https://github.com/HabanaAI/vllm-fork/pull/109).
max_num_prefill_seqs - padding-aware scheduler will perform an additional check for prefill batch size and will effectively limit prefill batch size at maximum of max_num_prefill_seqs. If unset, max prefill batch size will be max_num_seqs. Both features are generic and do not require HPU, although they may be specialized for particular vendor's usage. Padding aware scheduling includes padding function selector which selects HPU padding function (considering currently used HPU buckets) if current device is HPU. Otherwise, it will take a product of batch_size x max_seq_len.

HabanaAI / vllm-fork