Open tfriedel opened 1 month ago
Thanks @tfriedel this will be possible to support with Big-AGI 2. It's a good technology, and the savings are very significant. Someone needs to use the same message twice to make use of this, but it also means that the last user message of the conversation will be cached for no reason. Do you have a pulse/ideas for the best heuristics on when to apply, and more importantly Not apply the caching?
So we typically want this enabled if a large file is added. E.g. source code, an article, a book and so on. Also for system prompts that are large. This feature in theory allows you to stuff tons of stuff into your system prompt and thus rival fine-tuning.
Now when should this not apply? If we don't need to ask follow up questions! Use cases:
Since there are these use cases, having this easily toggable would be nice. Maybe with states off/auto ? Where auto uses some heuristics.
I'd say in auto mode we check if the diff to the last cache breakpoint is large. I.e. let's say we have a convo like this: <breakpoint 0> (here we haven't actually cached anything yet) system: ... user: .... assistant: ... user: [large file] .... <breakpoint 1> ( because the large file adds many tokens) user: .... assistant: .... user: .... assistant: ... user: [another large file] ... <breakpoint 2>
So we wouldn't automatically move the breakpoint to each new message, cause that would be expensive. Only if a large amount of tokens is added. What counts as large I don't know yet, this would need to be calculated.
I haven't yet fully grasped how this works in multi-turn conversations. If you have already cached 90% of the conversation and you add a few messages and then want to cache them again, do you then pay for 10% or 100%? If the former, then there would actually be little harm in continuously shifting the breakpoint. I think that's what they do in this example: https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb
If the latter, then I think it makes sense to update the cache only at large increments.
I'd probably start with something like the proposed idea and then later simulations could be performed on some saved conversations to optimize the heuristic rules.
I'm just thinking about this:
Need to be strategic bc if you cache tokens that don't get used within 5 minutes, you're paying 25% more for them. 90% cheaper tokens on the cache hits - so make sure you hit enough!
5 minutes is short! Let's say I put some full source repo in the chat and work in parallel on the source code. I may often spend more than 5 minutes before coming back to the chat window. I wonder if we can keep the cache warm by sending some short dummy "keep alive" message every 4 minutes?
The keep alive is such a good idea :)
I see what you mean for the complexity. I want to have the perfect automatic policy for the users, so that they just have the same experience but pay less money.
But it's not that easy to get it to the optimal planning. For starters we don't know if the user will intendo to continue the conversation, although we could parse it (or see the length of the user message as a signal).
For system prompts, the chance that they get reused is high, so can auto-breakpoint those, and tools too. For the chat messages, I like the strat of 2 breakpoints on the last 2 user messages (so both adding 1 message, or regenerating the last will be cheap), but I also can't get over the fac that many time the user will pay +25% more, to then not hit the cache again.
Bare-bones implementation on Big-AGI 2. The Anthropic API will throw if the chat doesn't have at least 1024 tokens in the system prompt - we expect Anthropic to change this soon.
This is working well, and we already have collection of tokens, so price reporting will come soon. However price prediction is more complex as we'd have to handle timers, make assumptions on hitting the breakpoints, etc.
For now there's just a switch on the Anthropic Models: 1. turn on, 2. hit Refresh
, and then chat normally.
Note that given the API is not really malleable, for now we should give the user the control of placing a breakpoint on a message (how does it work to send a message placing the breakpoint tho? - we'd need extra send functionality for that). In the meantime, it's a per-provider (i.e. all Anthropic models) option, but easy to toggle.
@tfriedel @enricoros Keep alive implementation propose:
LastRequest : The request that was used to get the newest ai-answer which is shown in chat.
Process/flow : Roughly 4min48sec after the latest ai-response, send LastRequest again but with only the last cache-mark and with "max_tokens": 1
(only one token answer which is discarded) . This alive extension could be done x times according to a user setting for anthropic.
Edit: When the user already sent a new message, of course no request on 4min48sec for the old one is needed as the new sent prompt already keeps the cache alive.
Comment to the existing implementation:
I like the strat of 2 breakpoints on the last 2 user message
That's also what Anthropic proposes, second last user message for cache read, last user message for cache write (because everything is cached till the breakpoints, that's what they mean with prefix).
Why By using anthropic's prompt caching feature API input costs can be reduced by up to 90% and latency by up to 80%. For an explanation see: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching and https://x.com/alexalbert__/status/1823751966893465630
Description A switch to enable or disable this feature. Because initial costs are higher and the feature is in beta it may make sense to allow to disable it.