BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.64k stars 1.47k forks source link

[Bug]: TTL not being set for cached batch requests #6010

Open nhs-work opened 1 day ago

nhs-work commented 1 day ago

What happened?

LiteLLM versions tested: main-v1.40.9-stable, main-v1.44.22-stable, main-v1.48.7-stable

Set up:

  litellm_settings:
    cache: true
    cache_params:
      type: redis
      ttl: 600
      default_in_memory_ttl: 600
      default_in_redis_ttl: 600
    json_logs: true
    set_verbose: true
  router_settings:
    routing_strategy: simple-shuffle
    enable_pre_call_checks: true

What is observed?

Calling LiteLLM via the following curl command correctly sets the TTL to be 600 based on the configs above:

curl --location 'https://litellm.com/openai/deployments/text-embedding-3-small/embeddings' \
    -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" \
    --header 'Content-Type: application/json' \
    -H "Authorization: Bearer sk-xxx" \
    --data '{
    "model": "text-embedding-3-small",
    "input": "embed this text"
}'

However the following curl command (note that input is now an array) sets the ttl as -1:

curl --location 'https://litellm.com/openai/deployments/text-embedding-3-small/embeddings' \
    -A "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0" \
    --header 'Content-Type: application/json' \
    -H "Authorization: Bearer sk-xxx" \
    --data '{
    "model": "text-embedding-3-small",
    "input": ["embed this text"]
}'

Note that the issue occurs for all valid non-string values for input, which are: List[str] | List[int] | List[List[int]] based on openai.

Why is this an issue?

Users using Langchain might call LiteLLM via OpenAIEmbeddings which would provide an array of tokens (int), ref. This would result in the cache being filled with key value pairs with no expiry causing the cache to run out of space.

I have also noticed that when redis is out of space in such a manner, the litellm pods tend to be stuck and restart whenever there are multiple calls being made.

Relevant log output

Sanitized and summarized logs from main-v1.44.22-stable:

Request to litellm:

litellm.aembedding(api_key='xxx', api_base='https://xxx.openai.azure.com/', api_version='2024-06-01', model='azure/text-embedding-3-small', input=["input"], caching=True, client=<openai.lib.azure.AsyncAzureOpenAI object at 0x7f78a7654ed0>, encoding_format='base64', proxy_server_request={'url': 'http://litellm.monitoring:4000/embeddings', 'method': 'POST', 'headers': {'host': 'litellm.monitoring:4000', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'accept': 'application/json', 'content-type': 'application/json', 'user-agent': 'OpenAI/Python 1.3.9', 'x-stainless-lang': 'python', 'x-stainless-package-version': '1.3.9', 'x-stainless-os': 'Linux', 'x-stainless-arch': 'x64', 'x-stainless-runtime': 'CPython', 'x-stainless-runtime-version': '3.10.12', 'authorization': '', 'x-stainless-async': 'false', 'content-length': '175'}, 'body': {'input': [[791, 432, 1929, 70910, 374, 459, 10309, 5507, 430, 8720, 288, 701, 11164, 596, 13708, 13]], 'model': 'text-embedding-3-small', 'encoding_format': 'base64'}}, metadata={'user_api_key': '', 'user_api_key_alias': '','user_api_end_user_max_budget': None, 'litellm_api_version': '1.44.22', 'global_max_parallel_requests': None, 'user_api_key_user_id': 'admin', 'user_api_key_org_id': None, 'user_api_key_team_id': '', 'user_api_key_team_alias': '', 'user_api_key_team_max_budget': None, 'user_api_key_team_spend': 53.1909487550011, 'user_api_key_spend': 0.4650396200000007, 'user_api_key_max_budget': None, 'user_api_key_metadata': {}, 'headers': {'host': 'litellm.monitoring:4000', 'accept-encoding': 'gzip, deflate', 'connection': 'keep-alive', 'accept': 'application/json', 'content-type': 'application/json', 'user-agent': 'OpenAI/Python 1.3.9', 'x-stainless-lang': 'python', 'x-stainless-package-version': '1.3.9', 'x-stainless-os': 'Linux', 'x-stainless-arch': 'x64', 'x-stainless-runtime': 'CPython', 'x-stainless-runtime-version': '3.10.12', 'x-stainless-async': 'false', 'content-length': '175'}, 'endpoint': 'http://litellm.monitoring:4000/embeddings', 'litellm_parent_otel_span': None, 'requester_ip_address': '', 'model_group': 'text-embedding-3-small', 'deployment': 'azure/text-embedding-3-small', 'model_info': {'id': '', 'db_model': False, 'base_model': 'azure/text-embedding-3-small', 'max_tokens': 8191},'api_base': 'https://xxx.openai.azure.com/', 'caching_groups': None}, model_info={'id': 'xxx', 'db_model': False, 'base_model': 'azure/text-embedding-3-small', 'max_tokens': 8191}, timeout=None, max_retries=0) # note that no ttl values are included here

Initialized litellm callbacks, Async Success Callbacks: ['cache', <bound method Router.deployment_callback_on_success of <litellm.router.Router object at 0x7f78a8683850>>, <function _PROXY_track_cost_callback at 0x7f78a9f11c60>, <litellm.proxy.hooks.parallel_request_limiter._PROXY_MaxParallelRequestsHandler object at 0x7f78a9f269d0>, <litellm.proxy.hooks.max_budget_limiter._PROXY_MaxBudgetLimiter object at 0x7f78a9f26a10>, <litellm.proxy.hooks.cache_control_check._PROXY_CacheControlCheck object at 0x7f78aa0cad10>, <litellm._service_logger.ServiceLogging object at 0x7f78a7ba6dd0>, <litellm.integrations.prometheus.PrometheusLogger object at 0x7f78a784ff90>, <bound method SlackAlerting.response_taking_too_long_callback of <litellm.integrations.slack_alerting.SlackAlerting object at 0x7f78aa1cf610>>]
ASYNC kwargs[caching]: True; litellm.cache: <litellm.caching.Cache object at 0x7f78a9ff8310>; kwargs.get('cache'): None
INSIDE CHECKING CACHE
Checking Cache

Getting Cache key. Kwargs: {} # note that no ttl values are included here

Created cache key: model: text-embedding-3-smallinput: inputencoding_format: base64
Hashed cache key (SHA-256): 98f1de774cc1627f2d926d6847cad7f0446cb2ba7205edf9585568e2abf73f98
Get Async Redis Cache: key: 98f1de774cc1627f2d926d6847cad7f0446cb2ba7205edf9585568e2abf73f98
Got Async Redis Cache: key: 98f1de774cc1627f2d926d6847cad7f0446cb2ba7205edf9585568e2abf73f98, cached_response None

{'model': 'text-embedding-3-small', 'messages': [{'role': 'user', 'content': ""}], 'optional_params': {}
RAW RESPONSE:
{"data": [{"embedding": "", "index": 0, "object": "embedding"}], "model": "text-embedding-3-small", "object": "list", "usage": {"prompt_tokens": 16, "total_tokens": 16}}

Looking up model=azure/text-embedding-3-small in model_cost_map
Success: model=azure/text-embedding-3-small in model_cost_map
prompt_tokens=16; completion_tokens=0
Returned custom cost for model=azure/text-embedding-3-small - prompt_tokens_cost_usd_dollar: 3.2e-07, completion_tokens_cost_usd_dollar: 0.0
Async Wrapper: Completed Call, calling async_success_handler: <bound method Logging.async_success_handler of <litellm.litellm_core_utils.litellm_logging.Logging object at 0x7f78a77bd810>>
Logging Details LiteLLM-Success Call: Cache_hit=None
Looking up model=azure/text-embedding-3-small in model_cost_map
Success: model=azure/text-embedding-3-small in model_cost_map
prompt_tokens=16; completion_tokens=0
Returned custom cost for model=azure/text-embedding-3-small - prompt_tokens_cost_usd_dollar: 3.2e-07, completion_tokens_cost_usd_dollar: 0.0
{"message": "litellm.aembedding(model=azure/text-embedding-3-small)\u001b[32m 200 OK\u001b[0m", "level": "INFO", "timestamp": "2024-10-02T06:22:00.316960"}

{"message": "Async Response: EmbeddingResponse(model='text-embedding-3-small', data=[{'embedding': 'xxx', 'index': 0, 'object': 'embedding'}], object='list', usage=Usage(completion_tokens=0, prompt_tokens=16, total_tokens=16))", "level": "DEBUG", "timestamp": "2024-10-02T06:22:00.317126"}
Getting Cache key. Kwargs: {'model': 'text-embedding-3-small', 'messages': [{'role': 'user', 'content': "input"}], 'optional_params': {}

Created cache key: model: text-embedding-3-smallinput: inputencoding_format: base64
Hashed cache key (SHA-256): 98f1de774cc1627f2d926d6847cad7f0446cb2ba7205edf9585568e2abf73f98
Set Async Redis Cache: key list: [('98f1de774cc1627f2d926d6847cad7f0446cb2ba7205edf9585568e2abf73f98', {'timestamp': 1727850120.3173609, 'response': {'embedding': '', 'index': 0, 'object': 'embedding'}})]
ttl=None, redis_version=7.1.0
Set ASYNC Redis Cache PIPELINE: key: 98f1de774cc1627f2d926d6847cad7f0446cb2ba7205edf9585568e2abf73f98
Value {'timestamp': 1727850120.3173609, 'response': {'embedding': '', 'index': 0, 'object': 'embedding'}}
ttl=None
Logging Details LiteLLM-Async Success Call, cache_hit=None
Looking up model=azure/text-embedding-3-small in model_cost_map
Success: model=azure/text-embedding-3-small in model_cost_map
prompt_tokens=16; completion_tokens=0
Returned custom cost for model=azure/text-embedding-3-small - prompt_tokens_cost_usd_dollar: 3.2e-07, completion_tokens_cost_usd_dollar: 0.0

Twitter / LinkedIn details

No response

nhs-work commented 1 day ago

Based on preliminary investigations and the logs, it appears as though the code flow is as follows (feel free to correct if I'm incorrect):

  1. Call being made to litellm/utils.py’s client’s async def wrapper_async which calls https://github.com/BerriAI/litellm/blob/e19bb55e3b4c6a858b6e364302ebbf6633a51de5/litellm/utils.py#L1068 (hence the Request to litellm: log line)
  2. Thie calls the https://github.com/BerriAI/litellm/blob/e19bb55e3b4c6a858b6e364302ebbf6633a51de5/litellm/utils.py#L1459-L1463 with the initial args and kwargs printed out before (which does not include ttl)
  3. This calls https://github.com/BerriAI/litellm/blob/e19bb55e3b4c6a858b6e364302ebbf6633a51de5/litellm/caching.py#L2619
  4. This prints out Set Async Redis Cache: key list: https://github.com/BerriAI/litellm/blob/e19bb55e3b4c6a858b6e364302ebbf6633a51de5/litellm/caching.py#L471

It seems like the ttl configs are not correctly being passed down?