Closed kaetemi closed 3 weeks ago
@compilade Can you share any insights into this?
@kaetemi
Defragmenting when it fails should be good enough, and should be fast enough (I think). llama_kv_cache_defrag
should do the right thing, but only at the next llama_kv_cache_update
or llama_decode
.
No idea how slow (or fast) it would be, but if it's too slow, it might be possible to accelerate llama_kv_cache_defrag_internal
with ggml_get_rows
, or an equivalent, but without dequantization.
Fragmented allocation might be complicated to handle, because then the saved session KV cache would need to be copied in multiple chunks (which would also be much slower with a GPU-backed KV cache, but not slower than defragmentation). And llama_kv_cache_find_slot
would require deeper changes to support fragmented allocation, which doesn't really fit with its current API.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Feature Description
Currently when restoring KV cache the operation will simply fail if there is not enough sequential space to restore the slot. Need a way to make space or change the allocation strategy during restore. (My current strategy is just to use a slot less every time it fails. It seems to settle around 50% slot usage. This is probably not ideal.)
Motivation
Reliability.
Possible Implementation
1) Can we just defragment when this happens? This seems like the easiest solution. Should we expose an endpoint in the server to defragment, or do it whenever the failure happens? What are the performance implications of defragmenting?
or 2) Can we allocate in a fragmented manner? And what are the performance implications in that case?