BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
14.08k stars 1.66k forks source link

[Feature]: Prompt caching friendly routing strategy #6784

Open jeromeroussin opened 5 days ago

jeromeroussin commented 5 days ago

The Feature

Prompt caching is harder to trigger when litellm load balances across several deployments (using Azure as an example). If the litellm gateway is configured with, say, 3 deployments for a specific model, it may take 3 or more calls before prompt caching takes place, and cost saving+lower latency achieved. The more deployments for a single model, the more calls it will take to "warm up" prompt caching in each deployment.

I am suggesting the following prompt caching friendly routing strategy: whenever a prompt of over 1024 tokens is detected, litellm would cache the beginning of the prompt along with the model-id it landed on. On subsequent calls with the same first 1024 tokens in the prompt, litellm would route the request to the same cached model-id. The cache entries would only need to live in the cache for as long as the prompt caching TTL of the LLM providers themselves (which varies from 5m to one hour)

Motivation, pitch

Lower costs, lower latencies with prompt caching that kicks in immediately while not sacrificing load-balancing.

Twitter / LinkedIn details

https://www.linkedin.com/in/jeromeroussin/

krrishdholakia commented 2 days ago

Need a way to: