BerriAI / litellm

Python SDK, Proxy Server to call 100+ LLM APIs using the OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
11.68k stars 1.34k forks source link

[Feature]: Support incoming `/v1/batches*` requests #3246

Closed Manouchehri closed 2 months ago

Manouchehri commented 4 months ago

The Feature

https://platform.openai.com/docs/api-reference/batch/create

litellm_settings:
  cache: True
  s3_batching_params: # non-optional if using batching; no bucket, no batching
    s3_bucket_name: os.environ/BATCHING_S3_BUCKET_NAME
    s3_region_name: os.environ/BATCHING_AWS_DEFAULT_REGION
    s3_aws_access_key_id: os.environ/BATCHING_AWS_ACCESS_KEY_ID
    s3_aws_secret_access_key: os.environ/BATCHING_AWS_SECRET_ACCESS_KEY
    s3_endpoint_url: os.environ/BATCHING_AWS_ENDPOINT_URL_S3

router_settings:
  batching_strategy: usage-based-routing
  batching_strategy_args: {"ttl": 60, "reserved_capacity": 0.1} # leave 10% of capacity for non-batched urgent requests

  # other batching strategies:
  # batching_strategy: simple-duration # this will do a request every completion_window/batch_size seconds 
  # e.g. for a batch size of 10 and completion_window of 60 seconds, it will do a request every 6 seconds

environment_variables:

model_list:
  - model_name: gpt-4-turbo-2024-04-09
    litellm_params:
      model: openai/gpt-4-turbo-2024-04-09
      api_key: os.environ/OPENAI_API_KEY
      rpm: 50
      batching_params:
        native_batching: True # default is False
        completion_window: 86400
        enforce_completion_window: True # default is True if native_batching is True
        # enforce_completion_window: False # This will cause the router to send non-batched requests if the user's completion window is less than the model's completion window

  # for models that don't support native batching, LiteLLM should just send the requests as normal
  - model_name: gemini-1.5-pro-preview-0409
    litellm_params:
      model: vertex_ai/gemini-1.5-pro-preview-0409
      rpm: 10
      batching_params:
        native_batching: False # default is False. Allow batching to be used, even though there's no cost savings
        # if completion_window is undefined, then allow any completion window between the min and max
        completion_window_min: 60
        completion_window_max: 3600

Motivation, pitch

For non-real time tasks, the new OpenAI batching endpoints look super helpful. The main reason I see folks using it is for cost.

Even for models that don't support cost savings through batching, it would be super handy to have LiteLLM be able to load balance a large number of prompts across a requested time period.

e.g. say I want to run Gemini 1.5 Pro via Google AI Studio on 1,000 prompts that I have prepared in a Lambda data ingestion processing script. Without batching, I would have to keep that Lambda ingestion script running for over 3 hours waiting for LiteLLM to return all 1,000 responses. With batching, I could split that ingestion script to send all the prompts initially in a few seconds, and quit; then, I could have another processing script on a scheduler to check every hour if the batched request was completed. With batching, I no longer have to have long running scripts. =)

Twitter / LinkedIn details

https://www.linkedin.com/in/davidmanouchehri/

krrishdholakia commented 2 months ago

Linking to ishaan's pr that added support for batches - https://github.com/BerriAI/litellm/pull/3885

ishaan-jaff commented 2 months ago

we support batches https://docs.litellm.ai/docs/batches