BerriAI / litellm

Python SDK, Proxy Server to call 100+ LLM APIs using the OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.17k stars 1.41k forks source link

[Feature]: LiteLLM SDK - Add support for Google AI Studio context caching #5213

Closed ishaan-jaff closed 2 weeks ago

ishaan-jaff commented 4 weeks ago

The Feature

-

Motivation, pitch

-

Twitter / LinkedIn details

No response

ishaan-jaff commented 4 weeks ago

Aiming to have this be compatible with Anthropic prompt caching

krrishdholakia commented 2 weeks ago

Looks like you can specify the id of the cached object - we could have this follow something like our _get_cache_key logic, generate a unique hash for the cached object -> store it -> check if cached object exists

krrishdholakia commented 2 weeks ago

Looks like there might be size requirements on what gets cached

Screenshot 2024-08-26 at 3 18 26 PM
krrishdholakia commented 2 weeks ago

Unlike Anthropic, there is a minimum input token count for what can be cached -

Screenshot 2024-08-26 at 3 19 10 PM
ishaan-jaff commented 2 weeks ago

Unlike Anthropic, there is a minimum input token count for what can be cached -

Anthropic also has a min input token requirement for caching btw

krrishdholakia commented 2 weeks ago

got it

krrishdholakia commented 2 weeks ago

(base) krrishdholakia@Krrishs-MacBook-Air temp_py_folder % curl -X POST "https://generativelanguage.googleapis.com/v1beta/ cachedContents?key=$GEMINI_API_KEY" \ -H 'Content-Type: application/json' \ -d @request.json \

cache.json % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1104k 0 307 100 1104k 179 646k 0:00:01 0:00:01 --:--:-- 648k (base) krrishdholakia@Krrishs-MacBook-Air temp_py_folder % cat cache.json { "name": "cachedContents/4d2kd477o3pg", "model": "models/gemini-1.5-flash-001", "createTime": "2024-08-26T22:31:16.147190Z", "updateTime": "2024-08-26T22:31:16.147190Z", "expireTime": "2024-08-26T22:36:15.548934784Z", "displayName": "", "usageMetadata": { "totalTokenCount": 323383 } }

krrishdholakia commented 2 weeks ago

Looks like the cache key is returned as part of the response object.

krrishdholakia commented 2 weeks ago

when you do curl -X GET on the cache name, you just get back the cache response object

curl "https://generativelanguage.googleapis.com/v1beta/cachedCo
ntents/4d2kd477o3pg?key=GEMINI_API_KEY"

{ "name": "cachedContents/4d2kd477o3pg", "model": "models/gemini-1.5-flash-001", "createTime": "2024-08-26T22:31:16.147190Z", "updateTime": "2024-08-26T22:31:16.147190Z", "expireTime": "2024-08-26T22:36:15.548934784Z", "displayName": "", "usageMetadata": { "totalTokenCount": 323383 } }


This is probably a good way to check if a cached key exists on Google's side, if not -> create it -> run request

krrishdholakia commented 2 weeks ago

Looks like Gemini just allows 1 cached content to be part of a request message. So we'll probably need to add a check to the input message, for multiple cached messages?

curl -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-001:generateContent?key=$GOOGLE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
      "contents": [
        {
          "parts":[{
            "text": "Please summarize this transcript"
          }],
          "role": "user"
        },
      ],
      "cachedContent": "'$CACHE_NAME'"
    }'
krrishdholakia commented 2 weeks ago

I guess for v0, just support 1 message to be cached.

Future improvement: support a block of continuous messages to be cached. (vertex allows passing this)