Originally posted by **julmb** August 15, 2024
Let's say I want to use llama.cpp as a shared library to build a service that other applications can make requests to. When this service gets a request, it feeds it to the model via `llama_decode`. The tokens that make up the request are processed and added to the internal KV cache.
Now, when the next request arrives, I need to decide which prefix of the request is already cached and therefore does not need to be processed again. From what I understand the KV cache does not store the actual tokens. So I have no way of knowing which part of the cache needs to be cleared and which part of the request tokens need to be fed to the model.
As far as I can tell, I have two options:
1. Clear the entire cache and reprocess the entire request. This is of course slow, especially for requests that share a large prefix.
2. Keep track if which tokens are currently in the cache myself. This is error prone as "what I believe is currently in the KV cache" and "what is actually in the KV cache" could easily get out of sync if I am not very careful (especially in the case of exceptions or other interruptions).
It seems like llama.cpp offers a stateful interface for interacting with the model/context but in some parts is lacking ways to inspect the state that the model/context is currently in, which makes it awkward to work with.
Would it not make sense for the KV cache structure inside of llama.cpp to keep track of which tokens (and which `seq_id`s) are currently in the cache and make this information available to users of the library? From what I can tell, the actual tokens would take up a vanishingly small amount of memory compared to the tensors the cache already stores.
Disclaimer: I only started using llama.cpp a few weeks ago, so I might be misunderstanding something. But if this is considered a good idea, maybe I should make an issue for it?
Discussed in https://github.com/ggerganov/llama.cpp/discussions/9043