Greatly improve KV cache size in low-memory environments

PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine

https://pygmalion.chat

GNU Affero General Public License v3.0

660 stars 80 forks source link

Greatly improve KV cache size in low-memory environments #335

Closed 50h100a closed 2 months ago

50h100a commented 2 months ago

When calculating kv cache size, include the blocks used during profiling.

Previously these were hidden inside the peak_memory calculation, which caused a significant divergence between the calculated kv cache size and the actual memory available.

Needs careful testing across lots of hardware and models.

50h100a commented 2 months ago

this is nonsense. kvcache deduplication meant my synthetic workload was a little TOO synthetic. expanding to more representative data made the discrepancy disappear.