[Feature]: Batch completion that takes into account rate limits and retries failed requests

antonioloison commented 4 months ago

The Feature

Modify the batch_completion method so that it can take into account several arguments related to rate limits and request retries:

max_requests_per_minute :
- target number of requests to make per minute (will make less if limited by tokens)
max_tokens_per_minute :
- target number of tokens to use per minute (will use less if limited by requests)
max_attempts :
- number of times to retry a failed request before giving up

This script from OpenAI is a good example of how this feature could be developed.

Motivation, pitch

The batch_completion method is based on multi threading and the number of workers (100) can't be modified. Many APIs have rate limits and it would be nice to take them into account when doing batch_completion to avoid the RateLimitError. It is also important to retry the requests if there is an error to have a more robust batch_completion.

Twitter / LinkedIn details

@antonio_loison

ishaan-jaff commented 4 months ago

@antonioloison we have this already it's on the router https://github.com/BerriAI/litellm/pull/3751

feel free to re-open if we did not address this in the PR

antonioloison commented 3 months ago

Thank you @ishaan-jaff for pointing out the abatch_completion_one_model_multiple_requests.

This function currently processes the batch of completions but it does not seem to manage throttling in relation to the rate limits. Ideally, I want it to automatically pause the LLM calls to stay under the rate limit and handle requests and retries until all the requests are completed without encountering RateLimitError.

ishaan-jaff commented 3 months ago

@antonioloison it does, set routing_strategy="usage-based-routing-v2"

doc here: https://docs.litellm.ai/docs/routing#advanced---routing-strategies

Can we hop on a call to figure out what you need here @antonioloison. What's the best email to send an invite to?

Link to my cal here if that's easier https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version?month=2023-10

antonioloison commented 3 months ago

Sure, it would be great to have a call. I sent you an invitation.

antonioloison commented 3 months ago

This is the synthetic task that I mentioned in the call where I call Anthropic's Haiku model (I have the Free Tier so a rate limit of 5 RPM). The responses return RateLimitErrors. So, I would like the parallel calls to be handled automatically with some kind of concurrency scheduler like aiometer or something like this OpenAI script:

import asyncio
import os
import random

import litellm
from litellm import Router

random.seed(42)

def get_random_string():
    letters = "abcdefghijklmnopqrstuvwxyz"
    return "".join(random.choice(letters) for i in range(10))

model_list = [
    {
        "model_name": "claude-3-haiku-20240307",
        "litellm_params": {
            "model": "claude-3-haiku-20240307",
            "api_key": os.getenv("ANTHROPIC_API_KEY"),
            "timeout": 300,
        },
        "tpm": 250000,
        "rpm": 5,
    }
]
router = Router(
    model_list=model_list,
    routing_strategy="usage-based-routing-v2",  # 👈 KEY CHANGE
    num_retries=10,
)

responses = asyncio.run(
    router.abatch_completion_one_model_multiple_requests(
        model="claude-3-haiku-20240307",
        messages=[[{"role": "user", "content": f"What is {get_random_string()}?"}] for _ in range(10)],
    )
)

print(responses)

Here are the responses:

[ModelResponse(id='chatcmpl-889f32e3-d26d-4709-80ee-26be482cb88a', choices=[Choices(finish_reason='stop', index=0, message=Message(content='I\'m sorry, but "mkhfmswjri" does not appear to be a real word or have any meaning that I\'m aware of. It seems to just be a random string of letters. If you have a specific question or context around this term, please provide more information and I\'ll try my best to assist you.', role='assistant', tool_calls=[]))], created=1717084777, model='claude-3-haiku-20240307', object='chat.completion', system_fingerprint=None, usage=Usage(prompt_tokens=17, completion_tokens=72, total_tokens=89)), RateLimitError("AnthropicException - {'type': 'rate_limit_error', 'message': 'Number of concurrent connections has exceeded your rate limit. Please try again later or contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase.'}"), ModelResponse(id='chatcmpl-c697a721-11c8-4f51-9399-87a0b2c4c159', choices=[Choices(finish_reason='stop', index=0, message=Message(content='I\'m afraid "eexkfiuwts" does not appear to be a real word. It looks like a random string of letters without any clear meaning or definition. As an AI assistant, I don\'t have the ability to define or explain made-up words. If you have a specific question about a real word, I\'d be happy to try and assist you with that.', role='assistant', tool_calls=[]))], created=1717084777, model='claude-3-haiku-20240307', object='chat.completion', system_fingerprint=None, usage=Usage(prompt_tokens=16, completion_tokens=81, total_tokens=97)), ModelResponse(id='chatcmpl-e369bec9-fde9-4613-9611-82566466affa', choices=[Choices(finish_reason='stop', index=0, message=Message(content='I\'m afraid I don\'t have any meaningful information about the term "bdzsbnjxgc". It appears to be a random sequence of letters that doesn\'t correspond to any common word or known abbreviation that I\'m aware of. Without any additional context, I can\'t provide a clear answer about what this might refer to.', role='assistant', tool_calls=[]))], created=1717084777, model='claude-3-haiku-20240307', object='chat.completion', system_fingerprint=None, usage=Usage(prompt_tokens=17, completion_tokens=71, total_tokens=88)), RateLimitError("AnthropicException - {'type': 'rate_limit_error', 'message': 'Number of concurrent connections has exceeded your rate limit. Please try again later or contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase.'}"), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5')]

ishaan-jaff commented 3 months ago

cc @krrishdholakia

BerriAI / litellm