enricoros / big-AGI

Generative AI suite powered by state-of-the-art models and providing advanced AI/AGI functions. It features AI personas, AGI functions, multi-model chats, text-to-image, voice, response streaming, code highlighting and execution, PDF import, presets for developers, much more. Deploy on-prem or in the cloud.
https://big-agi.com
MIT License
5.26k stars 1.19k forks source link

[Roadmap] Support Anthropic's prompt caching feature #623

Open tfriedel opened 1 month ago

tfriedel commented 1 month ago

Why By using anthropic's prompt caching feature API input costs can be reduced by up to 90% and latency by up to 80%. For an explanation see: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching and https://x.com/alexalbert__/status/1823751966893465630

When you make an API call with these additions, we check if the designated parts of your prompt are already cached from a recent query. If so, we use the cached prompt, speeding up processing time and reducing costs. Speaking of costs, the initial API call is slightly more expensive (to account for storing the prompt in the cache) but all subsequent calls are one-tenth the normal price.

Description A switch to enable or disable this feature. Because initial costs are higher and the feature is in beta it may make sense to allow to disable it.

enricoros commented 1 month ago

Thanks @tfriedel this will be possible to support with Big-AGI 2. It's a good technology, and the savings are very significant. Someone needs to use the same message twice to make use of this, but it also means that the last user message of the conversation will be cached for no reason. Do you have a pulse/ideas for the best heuristics on when to apply, and more importantly Not apply the caching?

tfriedel commented 1 month ago

So we typically want this enabled if a large file is added. E.g. source code, an article, a book and so on. Also for system prompts that are large. This feature in theory allows you to stuff tons of stuff into your system prompt and thus rival fine-tuning.

Now when should this not apply? If we don't need to ask follow up questions! Use cases:

Since there are these use cases, having this easily toggable would be nice. Maybe with states off/auto ? Where auto uses some heuristics.

I'd say in auto mode we check if the diff to the last cache breakpoint is large. I.e. let's say we have a convo like this: <breakpoint 0> (here we haven't actually cached anything yet) system: ... user: .... assistant: ... user: [large file] .... <breakpoint 1> ( because the large file adds many tokens) user: .... assistant: .... user: .... assistant: ... user: [another large file] ... <breakpoint 2>

So we wouldn't automatically move the breakpoint to each new message, cause that would be expensive. Only if a large amount of tokens is added. What counts as large I don't know yet, this would need to be calculated.

I haven't yet fully grasped how this works in multi-turn conversations. If you have already cached 90% of the conversation and you add a few messages and then want to cache them again, do you then pay for 10% or 100%? If the former, then there would actually be little harm in continuously shifting the breakpoint. I think that's what they do in this example: https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb

If the latter, then I think it makes sense to update the cache only at large increments.

I'd probably start with something like the proposed idea and then later simulations could be performed on some saved conversations to optimize the heuristic rules.

tfriedel commented 1 month ago

I'm just thinking about this:

Need to be strategic bc if you cache tokens that don't get used within 5 minutes, you're paying 25% more for them. 90% cheaper tokens on the cache hits - so make sure you hit enough!

5 minutes is short! Let's say I put some full source repo in the chat and work in parallel on the source code. I may often spend more than 5 minutes before coming back to the chat window. I wonder if we can keep the cache warm by sending some short dummy "keep alive" message every 4 minutes?

enricoros commented 1 month ago

The keep alive is such a good idea :)

enricoros commented 1 month ago

I see what you mean for the complexity. I want to have the perfect automatic policy for the users, so that they just have the same experience but pay less money.

But it's not that easy to get it to the optimal planning. For starters we don't know if the user will intendo to continue the conversation, although we could parse it (or see the length of the user message as a signal).

For system prompts, the chance that they get reused is high, so can auto-breakpoint those, and tools too. For the chat messages, I like the strat of 2 breakpoints on the last 2 user messages (so both adding 1 message, or regenerating the last will be cheap), but I also can't get over the fac that many time the user will pay +25% more, to then not hit the cache again.

enricoros commented 1 month ago

Bare-bones implementation on Big-AGI 2. The Anthropic API will throw if the chat doesn't have at least 1024 tokens in the system prompt - we expect Anthropic to change this soon.

This is working well, and we already have collection of tokens, so price reporting will come soon. However price prediction is more complex as we'd have to handle timers, make assumptions on hitting the breakpoints, etc.

For now there's just a switch on the Anthropic Models: 1. turn on, 2. hit Refresh, and then chat normally. image

enricoros commented 1 month ago

Note that given the API is not really malleable, for now we should give the user the control of placing a breakpoint on a message (how does it work to send a message placing the breakpoint tho? - we'd need extra send functionality for that). In the meantime, it's a per-provider (i.e. all Anthropic models) option, but easy to toggle.

DoS007 commented 2 weeks ago

@tfriedel @enricoros Keep alive implementation propose:

LastRequest : The request that was used to get the newest ai-answer which is shown in chat. Process/flow : Roughly 4min48sec after the latest ai-response, send LastRequest again but with only the last cache-mark and with "max_tokens": 1 (only one token answer which is discarded) . This alive extension could be done x times according to a user setting for anthropic.

Edit: When the user already sent a new message, of course no request on 4min48sec for the old one is needed as the new sent prompt already keeps the cache alive.

Comment to the existing implementation:

I like the strat of 2 breakpoints on the last 2 user message

That's also what Anthropic proposes, second last user message for cache read, last user message for cache write (because everything is cached till the breakpoints, that's what they mean with prefix).