EricLBuehler / mistral.rs

Blazingly fast LLM inference.
MIT License
3.44k stars 248 forks source link

Enabling prefix cache for llama3 gguf #347

Open joshpopelka20gmail opened 3 months ago

joshpopelka20gmail commented 3 months ago

For my use case, I've been informed that prefix caching may help me reduce inference time (I'm working on an internal web service). Looking through the codebase, I see that there is some code for creating a prefix cache i.e. prefix_cacher.rs.

I'm really struggling to understand if this cacher is currently being used. It doesn't seem like the no_prefix_cache is available in the python api (though I do see with_prefix_cache_n). Is the prefix cache being used across messages? I don't plan on batching requests as I'd like a real time service so, for my use case, it'll be one message at a time. Since it'll be a service, the prompt will have the same prefix each time.

EricLBuehler commented 3 months ago

I'm really struggling to understand if this cacher is currently being used.

It is being used by default. If you set the prefix cache n to be 0, then it will not be used as no prefixes will be cached.

Since it'll be a service, the prompt will have the same prefix each time.

Currently, we only allow verbatim matches. I'm working on allowing subset matches in #350 which should make the rate of cache hits much higher and accelerate performance for use cases such as yours.

joshpopelka20gmail commented 3 months ago

This is excellent news! I'm looking forward to testing the new feature when you are finished. Thank you for taking the time to work on it. I was thinking I might try to write the code, but I couldn't follow the flow of the cacher to know if it was being used.

EricLBuehler commented 3 months ago

Yes, the prefix cache is a bit complicated and is spread out over multiple files. Here's an outline, for your and future reference, of how it works:

joshpopelka20 commented 3 months ago

I've been reading up more on prefix caching and I'm looking for a little clarification to improve my understanding.

How does your method compare with prefix-aware KV caching used in chunkattention https://arxiv.org/abs/2402.15220?

I ask mainly because in the paper, it is mentioned that To share the key/value tensors in memory, the shared system prompt must appear at the beginning of the sequence. For my use case, while the system prompt will be the same, I'm also using few shot learning. The examples sent to the model with vary (though the variation is limited). I'm guessing that you'll be using some eviction algorithm (like Least Recently Used), so will the algorithm cache some of the examples in the prompt between requests?

Also, I'm wondering if you'll implement something similar to "Two-phase Partition" from the paper. It seems like it'll offer some optimizations to the process though I'm not sure how much.

Sorry, if these questions sound a little naive, I'm not an AI researcher and am new to working with LLMs.

EricLBuehler commented 3 months ago

Hi @joshpopelka20! Thanks for the questions.

Currently, it is a bit limited, but #350 will make it more powerful by enabling subsequences. The paper you linked is similar to what we (will after #350) do. That is, we match the prefix of an input sequence and re-use its cache. So, using a system prompt is essentially guaranteed activation. I want to get #350 merged: it's so close but doesn't quite work yet for some reason.

We do use an eviction algorithm to do this, so it will keep only some prefixes in the device memory. In fact, #366 will enable you to write out prior prefix caches and use them at a later date.

polarathene commented 3 months ago

That is, we match the prefix of an input sequence and re-use its cache.

I assume that's different from semantic caching? Where they note when it's beneficial to have that caching vs not (the same query to intentionally produce a different response, rather than return a cached one).

https://github.com/EricLBuehler/mistral.rs/pull/350#issuecomment-2144124121 suggests that it might actually be intended to cache the same response? Just perhaps not as adept as having the caching delegated to Qdrant like that linked article describes?

joshpopelka20 commented 3 months ago

Thanks for working on these changes! I'm really looking forward to testing #350.

I have a question about #366. I see the cache is being saved to disk. Based on my understanding, my concern here is that the prefix cache will save us some latency from moving the data from device memory to the tensors (i.e., memory bandwidth), but reading from disk has a much higher latency than memory bandwidth. Am I missing something?

polarathene commented 3 months ago

but reading from disk has a much higher latency than memory bandwidth. Am I missing something?

If disk reads are frequent, that data should be in memory as a buffer/cache by the OS, provided the system isn't under memory pressure already.

You should be able to verify that with slower time for the initial read, but any subsequent reads should be quick. But if the data read is frequently updated and not partitioned into smaller segments then I guess that might not apply 🤷‍♂️

In that sort of scenario you could possibly use Redis or similar for cache, with disk backing for less frequently accessed items.

EricLBuehler commented 3 months ago

The disk is much slower, but the sequences there would be ones low on the LRU order, so it shouldn't incur a repetitive cost. There would be 3 tiers for the prefix caches, by LRU: in device memory, in CPU memory, on disk (perhaps backed by Redis in the future).

joshpopelka20 commented 3 months ago

Thanks for that explanation. It makes sense to me now. Also, it does seem like a nice use case for Redis. Looking forward to that future change if you choose to go that route.

EricLBuehler commented 3 months ago

Yeah, I'll look into that! I'm still working on #350, and I made some progress so now it is almost working. After I merge that, I'll apply the changes to #366 and work on Redis support.

However, my focus right now is getting a vision model (I'm working on Phi 3 Vision in #351), and it is almost ready! Hopefully I can merge that this week.