Investigate usage limits

mruwnik commented 1 year ago

Find out how many requests per second can be handled by the current system. This applies both to the server infrastructure, but also to the underlying LLM system

ccstan99 commented 1 year ago

OpenAI LLM chat rate limits Current defaults as of July 2023: 3,500 RPM (requests per minute) and 90,000 TPM (tokens per minute)

henri123lemoine commented 1 year ago

We can apply to increase our rate limits. Should we? It's also important to note that, should we decide to use it, GPT-4's current rate-limit is 200. OpenAI will increase that number over time.

henri123lemoine commented 1 year ago

Current rate limits are of 200 messages or 40,000 tokens per minute, which will likely be reached in 5-6 questions if we use the full context window for every chat. That might be a problem. We need to set up a system that handles GPT-4 rate limits by switching to ChatGPT if they occur. Additionally, we could limit spamming by slowing down the streaming if a single user sends queries at an impossible speed.

ishaan-jaff commented 1 year ago

@mruwnik @ccstan99 @henri123lemoine

i'm the maintainer of LiteLLM we allow you to maximize your throughput/increase rate limits - load balance between multiple deployments (Azure, OpenAI) I believe litellm can be helpful here - and i'd love your feedback if we're missing something

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

StampyAI / stampy-chat

Investigate usage limits #55