LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases

Is your feature request related to a problem? Please describe. I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:

<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information

User question: "How do I do <y>?"<eot>

In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).

This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.

Describe the solution you'd like I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing LlamaDiskCache but:

Cache is not mutable once built
- (Does not pop in __getitem__)
It uses a trie for finding the longest matching prefix (if any) in the cache
It has a convenience factory method for building from a list of prompts

So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.

Complication / Details with this

I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.

I.e., skip loading and set seed for reproducibility: https://github.com/tc-wolf/llama.cpp/commit/ea43d922e4169f4fd7622e4a8e6eca92ef921038

I don't think that this will be a factor anymore because https://github.com/ggerganov/llama.cpp/pull/9294 has removed serializing / deserializing the RNG when saving.

Describe alternatives you've considered

Use lower-level state saving functions (rather than pickling llama.save_state() to save less on-disk than full model file
Use more efficient strategy for saving - right now if each key has the same system prompt (for example), that will be saved independently for every stored prompt. There's a lot of space that could be saved if dedupe and only save each prefix once, but it complicates the saving/loading logic.
- Could also allow for partial matches when checking cache - right now a key has to be a full prefix of the input tokens, but could try and look for partial match to allow for more graceful failure.
- This also complicates the __getitem__ logic.

abetlen / llama-cpp-python

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

Complication / Details with this