Closed antonioloison closed 4 months ago
@antonioloison we have this already it's on the router https://github.com/BerriAI/litellm/pull/3751
feel free to re-open if we did not address this in the PR
Thank you @ishaan-jaff for pointing out the abatch_completion_one_model_multiple_requests
.
This function currently processes the batch of completions but it does not seem to manage throttling in relation to the rate limits. Ideally, I want it to automatically pause the LLM calls to stay under the rate limit and handle requests and retries until all the requests are completed without encountering RateLimitError
.
@antonioloison it does, set routing_strategy="usage-based-routing-v2"
doc here: https://docs.litellm.ai/docs/routing#advanced---routing-strategies
Can we hop on a call to figure out what you need here @antonioloison. What's the best email to send an invite to?
Link to my cal here if that's easier https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version?month=2023-10
Sure, it would be great to have a call. I sent you an invitation.
This is the synthetic task that I mentioned in the call where I call Anthropic's Haiku model (I have the Free Tier so a rate limit of 5 RPM). The responses return RateLimitErrors. So, I would like the parallel calls to be handled automatically with some kind of concurrency scheduler like aiometer
or something like this OpenAI script:
import asyncio
import os
import random
import litellm
from litellm import Router
random.seed(42)
def get_random_string():
letters = "abcdefghijklmnopqrstuvwxyz"
return "".join(random.choice(letters) for i in range(10))
model_list = [
{
"model_name": "claude-3-haiku-20240307",
"litellm_params": {
"model": "claude-3-haiku-20240307",
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"timeout": 300,
},
"tpm": 250000,
"rpm": 5,
}
]
router = Router(
model_list=model_list,
routing_strategy="usage-based-routing-v2", # 👈 KEY CHANGE
num_retries=10,
)
responses = asyncio.run(
router.abatch_completion_one_model_multiple_requests(
model="claude-3-haiku-20240307",
messages=[[{"role": "user", "content": f"What is {get_random_string()}?"}] for _ in range(10)],
)
)
print(responses)
Here are the responses:
[ModelResponse(id='chatcmpl-889f32e3-d26d-4709-80ee-26be482cb88a', choices=[Choices(finish_reason='stop', index=0, message=Message(content='I\'m sorry, but "mkhfmswjri" does not appear to be a real word or have any meaning that I\'m aware of. It seems to just be a random string of letters. If you have a specific question or context around this term, please provide more information and I\'ll try my best to assist you.', role='assistant', tool_calls=[]))], created=1717084777, model='claude-3-haiku-20240307', object='chat.completion', system_fingerprint=None, usage=Usage(prompt_tokens=17, completion_tokens=72, total_tokens=89)), RateLimitError("AnthropicException - {'type': 'rate_limit_error', 'message': 'Number of concurrent connections has exceeded your rate limit. Please try again later or contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase.'}"), ModelResponse(id='chatcmpl-c697a721-11c8-4f51-9399-87a0b2c4c159', choices=[Choices(finish_reason='stop', index=0, message=Message(content='I\'m afraid "eexkfiuwts" does not appear to be a real word. It looks like a random string of letters without any clear meaning or definition. As an AI assistant, I don\'t have the ability to define or explain made-up words. If you have a specific question about a real word, I\'d be happy to try and assist you with that.', role='assistant', tool_calls=[]))], created=1717084777, model='claude-3-haiku-20240307', object='chat.completion', system_fingerprint=None, usage=Usage(prompt_tokens=16, completion_tokens=81, total_tokens=97)), ModelResponse(id='chatcmpl-e369bec9-fde9-4613-9611-82566466affa', choices=[Choices(finish_reason='stop', index=0, message=Message(content='I\'m afraid I don\'t have any meaningful information about the term "bdzsbnjxgc". It appears to be a random sequence of letters that doesn\'t correspond to any common word or known abbreviation that I\'m aware of. Without any additional context, I can\'t provide a clear answer about what this might refer to.', role='assistant', tool_calls=[]))], created=1717084777, model='claude-3-haiku-20240307', object='chat.completion', system_fingerprint=None, usage=Usage(prompt_tokens=17, completion_tokens=71, total_tokens=88)), RateLimitError("AnthropicException - {'type': 'rate_limit_error', 'message': 'Number of concurrent connections has exceeded your rate limit. Please try again later or contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase.'}"), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5'), RateLimitError('Deployment over defined rpm limit=5. current usage=5')]
cc @krrishdholakia
The Feature
Modify the
batch_completion
method so that it can take into account several arguments related to rate limits and request retries:This script from OpenAI is a good example of how this feature could be developed.
Motivation, pitch
The
batch_completion
method is based on multi threading and the number of workers (100) can't be modified. Many APIs have rate limits and it would be nice to take them into account when doingbatch_completion
to avoid the RateLimitError. It is also important to retry the requests if there is an error to have a more robustbatch_completion
.Twitter / LinkedIn details
@antonio_loison