Feature Request: Some way to handle KV cache allocation failure during individual slot restore

kaetemi commented 2 months ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently when restoring KV cache the operation will simply fail if there is not enough sequential space to restore the slot. Need a way to make space or change the allocation strategy during restore. (My current strategy is just to use a slot less every time it fails. It seems to settle around 50% slot usage. This is probably not ideal.)

Motivation

Reliability.

Possible Implementation

1) Can we just defragment when this happens? This seems like the easiest solution. Should we expose an endpoint in the server to defragment, or do it whenever the failure happens? What are the performance implications of defragmenting?

or 2) Can we allocate in a fragmented manner? And what are the performance implications in that case?

kaetemi commented 2 months ago

@compilade Can you share any insights into this?

compilade commented 2 months ago

@kaetemi Defragmenting when it fails should be good enough, and should be fast enough (I think). llama_kv_cache_defrag should do the right thing, but only at the next llama_kv_cache_update or llama_decode.

No idea how slow (or fast) it would be, but if it's too slow, it might be possible to accelerate llama_kv_cache_defrag_internal with ggml_get_rows, or an equivalent, but without dequantization.

Fragmented allocation might be complicated to handle, because then the saved session KV cache would need to be copied in multiple chunks (which would also be much slower with a GPU-backed KV cache, but not slower than defragmentation). And llama_kv_cache_find_slot would require deeper changes to support fragmented allocation, which doesn't really fit with its current API.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp