This allows KV cache pre-allocation and key length padding outside of the inference runner. With this, the inference runner is exclusively a CPU optimization (except for small GPU gains from cuda graphs)
Separate PR for now because the inference runner needs to be adapted.
This allows KV cache pre-allocation and key length padding outside of the inference runner. With this, the inference runner is exclusively a CPU optimization (except for small GPU gains from cuda graphs)
Separate PR for now because the inference runner needs to be adapted.