BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.66k stars 1.47k forks source link

[Feature]: using Anthropic `retry-after` header #4387

Closed jamesbraza closed 6 days ago

jamesbraza commented 3 months ago

The Feature

Anthropic has a retry-after header in their response when one hits a 429 Too Many Requests error: https://docs.anthropic.com/en/api/rate-limits#response-headers

It looks like litellm==1.40.25's Anthropic code directly uses httpx for POSTs: https://github.com/BerriAI/litellm/blob/v1.40.25/litellm/llms/anthropic.py#L182

And from what I have read, it looks LiteLLM doesn't utilize the retry-after header anywhere in the normal completion call stack.


For reference, it seems LiteLLM's router entity does support retry-after: https://github.com/BerriAI/litellm/blob/v1.40.25/litellm/utils.py#L5457

Motivation, pitch

Can we support the retry-after header? It will enable retrying of Anthropic compliant with their API

Twitter / LinkedIn details

No response

ishaan-jaff commented 3 months ago

hi @jamesbraza - would you pass this with extra_headers ?

jamesbraza commented 3 months ago

Maybe I am misunderstanding your question, but I think the retry-after header is in the response from Anthropic to LiteLLM, so end users don't pass anything. Does that makes sense?

ishaan-jaff commented 3 months ago

oh you just want to access the response header from anthropic ?

jamesbraza commented 3 months ago

So the retry-after response header from Anthropic tells LiteLLM how long to wait before retrying an API call. Currently, LiteLLM doesn't take that parameter into account.

oh you just want to access the response header from anthropic ?

I am not looking to access the response header, the request is that LiteLLM's Anthropic code should be using their header when calculating how long to wait before a retry. Does that makes sense now?

ishaan-jaff commented 3 months ago
jamesbraza commented 3 months ago

I am using litellm.acompletion with some messages, max_retries=3, and an Anthropic model.

how do you want litellm to use retry-after in the completion calls ?

It seems like what happens is:

  1. max_retries gets considered an optional_params
  2. This gets passed to anthropic_chat_completions.completion
  3. Eventually a POST request takes place to Anthropic
  4. The Anthropic endpoint responds with a 429 error code and a retry-after header that tells how long clients should wait before retrying
  5. LiteLLM's Anthropic driver code doesn't know to look for the retry-after header

What I want to happen is at step 5: LiteLLM to take into account the retry-after header and sleep that duration, before retrying.

For reference, it looks like LiteLLM's azure_dall_e_2 is actually using the retry-header: https://github.com/BerriAI/litellm/blob/v1.41.3/litellm/llms/custom_httpx/azure_dall_e_2.py#L53. Though I am not sure a custom HTTP transport layer is necessary to respect retry-after, it seems overkill