The default batch_scheduler_policy of max_utilization does not work with enc_dec models that use in flight batching + streaming. However, the guaranteed_no_evict policy does work. We update the policy here for CI, but need to document this as well in release notes.
We still don't know exactly why this difference is only for enc_dec models. We should probably look into that more, as well as explore how this impacts performance for other model types. We're using max_utilization as the default since that maximizes throughput, but it seems like guaranteed no_evict is better for latency
…5 working with in flight batching
Description
The default batch_scheduler_policy of max_utilization does not work with enc_dec models that use in flight batching + streaming. However, the guaranteed_no_evict policy does work. We update the policy here for CI, but need to document this as well in release notes.
We still don't know exactly why this difference is only for enc_dec models. We should probably look into that more, as well as explore how this impacts performance for other model types. We're using max_utilization as the default since that maximizes throughput, but it seems like guaranteed no_evict is better for latency