ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.56k stars 9.85k forks source link

llama : store token ids in the KV Cache #9113

Open ggerganov opened 3 months ago

ggerganov commented 3 months ago

Discussed in https://github.com/ggerganov/llama.cpp/discussions/9043

Originally posted by **julmb** August 15, 2024 Let's say I want to use llama.cpp as a shared library to build a service that other applications can make requests to. When this service gets a request, it feeds it to the model via `llama_decode`. The tokens that make up the request are processed and added to the internal KV cache. Now, when the next request arrives, I need to decide which prefix of the request is already cached and therefore does not need to be processed again. From what I understand the KV cache does not store the actual tokens. So I have no way of knowing which part of the cache needs to be cleared and which part of the request tokens need to be fed to the model. As far as I can tell, I have two options: 1. Clear the entire cache and reprocess the entire request. This is of course slow, especially for requests that share a large prefix. 2. Keep track if which tokens are currently in the cache myself. This is error prone as "what I believe is currently in the KV cache" and "what is actually in the KV cache" could easily get out of sync if I am not very careful (especially in the case of exceptions or other interruptions). It seems like llama.cpp offers a stateful interface for interacting with the model/context but in some parts is lacking ways to inspect the state that the model/context is currently in, which makes it awkward to work with. Would it not make sense for the KV cache structure inside of llama.cpp to keep track of which tokens (and which `seq_id`s) are currently in the cache and make this information available to users of the library? From what I can tell, the actual tokens would take up a vanishingly small amount of memory compared to the tensors the cache already stores. Disclaimer: I only started using llama.cpp a few weeks ago, so I might be misunderstanding something. But if this is considered a good idea, maybe I should make an issue for it?
shankarg87 commented 2 months ago

Is this already being worked on? Would this be a "good first issue" for a new contributor? If so, can take a look

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.