[Help]: Cache Hits Always 0 when using the litellm proxy

databill86 commented 1 month ago

What happened?

When using the LiteLLM OpenAI proxy, I've noticed that the caching functionality is not working as expected. Specifically:

The cache_hit value is always 0, even when cached tokens are being used in the API response.

Code to Reproduce

Here's a minimal example that demonstrates the issue:


import openai
import json

client = openai.OpenAI(
    api_key="sk-kqzUMqrHbxg",
    base_url="http://localhost:4004",
)

messages = [
        {
        "role": "system",
        "content": "System prompt, the same, does not change",
    },
    {
        "role": "user",
        "content": "A long user prompt, more than 1024 tokens...",
    },
]

response = client.chat.completions.create(
    model="openai",
    messages=messages,
    response_format={"type": "json_object"},    
)

print(f"Usage: {response.usage}")
print(f"Cache hit: {response.usage.prompt_tokens_details.cached_tokens != 0}")
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")

The cached tokens count in the UI is also set to 0 and does not change, even after multiple requests that are hitting the cache.

litellm version: image: ghcr.io/berriai/litellm:main-v1.49.3

Relevant log output

Usage: CompletionUsage(completion_tokens=594, prompt_tokens=1253, total_tokens=1847, completion_tokens_details=CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=None, cached_tokens=1024))
Cache hit: True
Cached tokens: 1024

Twitter / LinkedIn details

No response

krrishdholakia commented 1 month ago

Oh! this isn't necessarily a bug but our caching logic != openai's prompt caching

You cached token field

cached_tokens=1024

Is openai's response, not from the proxy @databill86

databill86 commented 1 month ago

Oh, I see! That clears up some of the confusion.

However, I was specifically referring to OpenAI's prompt caching behavior as outlined in their documentation. Do you plan on supporting something more aligned with their caching mechanism in future releases?

BerriAI / litellm