abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.8k stars 934 forks source link

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases #1737

Open tc-wolf opened 2 weeks ago

tc-wolf commented 2 weeks ago

Is your feature request related to a problem? Please describe. I have a dataset that I'm using for RAG. The user's question determines lookup for the top N most relevant documents, then builds a prompt that looks like:

<system>
You are an assistant for domain <x>, you summarize information and blah blah blah.<eot>
<user>
Document 1: Some information that is relevant
Document 2: Other information
Document 3: Final information

User question: "How do I do <y>?"<eot>

In order to minimize latency, I've developed a "static" disk cache that contains the system prompt + the first document as context for every document in my dataset. (An example (though old) script for doing this is also in my branch).

This way, I only need to ingest the remaining documents + user question when doing prompt processing, so I save a lot of time in time-to-first-token for this use case.

Describe the solution you'd like I'd like to upstream the LlamaStaticDiskCache class in my branch. It's very similar to the existing LlamaDiskCache but:

So it's well-suited for use cases where you want to build the cache once (for a given model + context size + batch size) and then reload at inference time based on matching the prefix of the prompt.

Complication / Details with this

I've found that when running locally (Mac OS + Metal GPU) and deploying on different hardware (Linux + CPU), I have had to make a minor change to llama.cpp to avoid serializing / deserializing the RNG state.

I.e., skip loading and set seed for reproducibility: https://github.com/tc-wolf/llama.cpp/commit/ea43d922e4169f4fd7622e4a8e6eca92ef921038

I don't think that this will be a factor anymore because https://github.com/ggerganov/llama.cpp/pull/9294 has removed serializing / deserializing the RNG when saving.

Describe alternatives you've considered