When calculating kv cache size, include the blocks used during profiling.
Previously these were hidden inside the peak_memory calculation, which caused a significant divergence between the calculated kv cache size and the actual memory available.
Needs careful testing across lots of hardware and models.
this is nonsense. kvcache deduplication meant my synthetic workload was a little TOO synthetic.
expanding to more representative data made the discrepancy disappear.
When calculating kv cache size, include the blocks used during profiling.
Previously these were hidden inside the peak_memory calculation, which caused a significant divergence between the calculated kv cache size and the actual memory available.
Needs careful testing across lots of hardware and models.