Prompt caching with Claude

tuanardouin commented 1 month ago

Anthropic just announced a new feature "Prompt caching". It lowers cost and reduces latency, particularly for large context.

Extract from the article :

Prompt caching, which enables developers to cache frequently used context between API calls, is now available on the Anthropic API. With prompt caching, customers can provide Claude with more background knowledge and example outputs—all while reducing costs by up to 90% and latency by up to 85% for long prompts. Prompt caching is available today in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.

Apparently, you just have to add "cache_control": {"type": "ephemeral"} to the system to benefit from it. And it only works with anthropic-beta: prompt-caching-2024-07-31 in the header.

I'm not familiar with the code base, but if someone can give me some indications and point me in the right direction, I could add it.

dhodgejrrr commented 4 weeks ago

https://github.com/anthropics/anthropic-sdk-python/commit/2c2faf4d1543feb9752a5c86bc25ab51325e2197

m-marinucci commented 4 weeks ago

Anthropic just announced a new feature "Prompt caching". It lowers cost and reduces latency, particularly for large context.

Extract from the article :

Prompt caching, which enables developers to cache frequently used context between API calls, is now available on the Anthropic API. With prompt caching, customers can provide Claude with more background knowledge and example outputs—all while reducing costs by up to 90% and latency by up to 85% for long prompts. Prompt caching is available today in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.

Apparently, you just have to add "cache_control": {"type": "ephemeral"} to the system to benefit from it. And it only works with anthropic-beta: prompt-caching-2024-07-31 in the header.

I'm not familiar with the code base, but if someone can give me some indications and point me in the right direction, I could add it.

yes please I am spending a fortune and possibly this will help us all save a lot of money!

carlanwray commented 4 weeks ago

Don't forget the ~5 minute cache timeout when implementing this. It might make sense to consider performing an otherwise meaningless "refresh" API call in some instances rather than pushing the entire context on a subsequent call. Something to consider.

tuanardouin commented 4 weeks ago

@carlanwray Good point, I didn't see that. So it's becoming a bit more complex than I thought.

What is the cache lifetime?

The cache has a lifetime (TTL) of about 5 minutes. This lifetime is refreshed each time the cached content is used.

I don't understand what happens after the 5 minutes. We're going to get an error ? The caching price will not be applied because it needs to be cached again?

How does Prompt Caching affect pricing?

Prompt Caching introduces a new pricing structure where cache writes cost 25% more than base input tokens, while cache hits cost only 10% of the base input token price.

I didn't see that. So the first call to the API will be more expensive and it's the subsequent one that are?

I need to make some tests before starting anything.

tuanardouin commented 4 weeks ago

@Doriandarko are you still maintaining this and accepting PRs?

Doriandarko commented 3 weeks ago

Yeah I was working on this!

m-marinucci commented 3 weeks ago

Grazie Pietro!

Doriandarko commented 3 weeks ago

This is done

Doriandarko / claude-engineer

Prompt caching with Claude #166