Open mruwnik opened 1 year ago
OpenAI LLM chat rate limits Current defaults as of July 2023: 3,500 RPM (requests per minute) and 90,000 TPM (tokens per minute)
We can apply to increase our rate limits. Should we? It's also important to note that, should we decide to use it, GPT-4's current rate-limit is 200. OpenAI will increase that number over time.
Current rate limits are of 200 messages or 40,000 tokens per minute, which will likely be reached in 5-6 questions if we use the full context window for every chat. That might be a problem. We need to set up a system that handles GPT-4 rate limits by switching to ChatGPT if they occur. Additionally, we could limit spamming by slowing down the streaming if a single user sends queries at an impossible speed.
@mruwnik @ccstan99 @henri123lemoine
i'm the maintainer of LiteLLM we allow you to maximize your throughput/increase rate limits - load balance between multiple deployments (Azure, OpenAI) I believe litellm can be helpful here - and i'd love your feedback if we're missing something
Here's how to use it Docs: https://docs.litellm.ai/docs/routing
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
Find out how many requests per second can be handled by the current system. This applies both to the server infrastructure, but also to the underlying LLM system