[computer use] Discussion about token use and potential for reducing unnecessary use of them

Quasimondo commented 3 hours ago

I don't know if there is a better place to have this discussion, so I apologize in advance for abusing this one, but after a bit more than a week of intensive use of computer use (which I think is an absolutely fantastic tool) I am hoping to find some ways to put Claude on some form of a token diet, since it is insatiable in gobbling up tokens during computer use and I am surely not the only one who is getting rate limited every day now after just a few hours of using it.

My token usage since I started computer use paints a pretty clear picture:

My first surprise since obviously I didn't read the fine menu is that cached prompts seem to account the same way that regular prompts do - so my first questions is: what is the benefit of prompt caching? Is it just to save a few KB of data upload? Is it to speed up the response time? What it does not seem to do is to reduce one's token use count.

My second question is: does computer use always send the entire (growing) communication thread to the API? Or do earlier user prompts or machine replies that get marked as "ephemeral" actually vanish from the input after a while? Could some intelligent (or manual) culling of older messages within a thread reduce the total token count?

A simple example would be one of the things that often happen, where Claude tries some bash tool and needs 3 attempts until it has figured out the right parameter scheme. Or it searches for a string in some files and finds it in the 3rd one. It is my impression that it could easily remove the failed attempts from the thread without a negative effect on the context.

So are there any ideas how to save on tokens in a typical workflow (apart from removing screenshots)?

Quasimondo commented 2 hours ago

I guess it helps to read the documentation (https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) Interesting:

Cache write tokens are 25% more expensive than base input tokens
Cache read tokens are 90% cheaper than base input tokens
The cache has a 5 minute time to live (TTL).
Changes that can break the cache include modifying any content, changing whether there are any images (anywhere in the prompt), and altering tool_choice.type. Any of these changes will require creating a new cache entry.

Do I interpret this correctly, that one one receives a "rate-limit" timeout for being over the token limit, the next API call makes it actually worse (and more expensive) since by then when the time has passed the previously cached tokens have expired and will get cached again? Also it seems that the strategy to only keep the last N screenshots might also interfere with the caching.

Quasimondo commented 1 hour ago

def _inject_prompt_caching(
    messages: list[BetaMessageParam],
):
    """
    Set cache breakpoints for the 3 most recent turns
    one cache breakpoint is left for tools/system prompt, to be shared across sessions
    """

    breakpoints_remaining = 3
    for message in reversed(messages):
        if message["role"] == "user" and isinstance(
            content := message["content"], list
        ):
            if breakpoints_remaining:
                breakpoints_remaining -= 1
                content[-1]["cache_control"] = BetaCacheControlEphemeralParam(
                    {"type": "ephemeral"}
                )
            else:
                content[-1].pop("cache_control", None)
                # we'll only every have one extra turn per loop
                break

Maybe I am missing something important, but the current caching strategy seems to be sub-optimal since first of all it will only cache user messages. The problem I see here is that a lot user messages are rather short - often they are just a confirmation or correction. But according to the documentation the minimum cacheable prompt length is 1024 tokens for Claude 3.5 Sonnet. The current algorithm does not take into account the message length at all meaning that it will blindly just cache the last 3 user messages even though several of them might not even be cacheable, dropping older messages that might be long enough to justify caching (since there are only 3 slots plus the one for the system prompt which is probably the most useful one)

Wouldn't it make more sense to cache the 3 longest messages, independent of whether they are user or machine prompts?

Of course in my ideal scenario Claude could determine itself which of the messages in the current thread are the most valuable ones to cache. But I guess that is wishful thinking.

anthropics / anthropic-quickstarts

[computer use] Discussion about token use and potential for reducing unnecessary use of them #145