Is your feature request related to a problem? Please describe.
I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:
<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information
User question: "How do I do <y>?"<eot>
In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).
This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.
Describe the solution you'd like
I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing LlamaDiskCache but:
Cache is not mutable once built
(Does not pop in __getitem__)
It uses a trie for finding the longest matching prefix (if any) in the cache
It has a convenience factory method for building from a list of prompts
So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.
Complication / Details with this
I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.
Use lower-level state saving functions (rather than pickling llama.save_state() to save less on-disk than full model file
Use more efficient strategy for saving - right now if each key has the same system prompt (for example), that will be saved independently for every stored prompt. There's a lot of space that could be saved if dedupe and only save each prefix once, but it complicates the saving/loading logic.
Could also allow for partial matches when checking cache - right now a key has to be a full prefix of the input tokens, but could try and look for partial match to allow for more graceful failure.
Is your feature request related to a problem? Please describe. I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:
In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).
This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.
Describe the solution you'd like I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing
LlamaDiskCache
but:pop
in__getitem__
)So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.
Complication / Details with this
I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.
I.e., skip loading and set seed for reproducibility: https://github.com/tc-wolf/llama.cpp/commit/ea43d922e4169f4fd7622e4a8e6eca92ef921038
I don't think that this will be a factor anymore because https://github.com/ggerganov/llama.cpp/pull/9294 has removed serializing / deserializing the RNG when saving.
Describe alternatives you've considered
llama.save_state()
to save less on-disk than full model file__getitem__
logic.