StampyAI / stampy-chat

Conversational chatbot to answer questions about AI Safety & Alignment based on information retrieved from the Alignment Research Dataset
https://chat.stampy.ai
MIT License
12 stars 5 forks source link

Investigate usage limits #55

Open mruwnik opened 11 months ago

mruwnik commented 11 months ago

Find out how many requests per second can be handled by the current system. This applies both to the server infrastructure, but also to the underlying LLM system

ccstan99 commented 11 months ago

OpenAI LLM chat rate limits Current defaults as of July 2023: 3,500 RPM (requests per minute) and 90,000 TPM (tokens per minute)

henri123lemoine commented 11 months ago

We can apply to increase our rate limits. Should we? It's also important to note that, should we decide to use it, GPT-4's current rate-limit is 200. OpenAI will increase that number over time.

henri123lemoine commented 11 months ago

Current rate limits are of 200 messages or 40,000 tokens per minute, which will likely be reached in 5-6 questions if we use the full context window for every chat. That might be a problem. We need to set up a system that handles GPT-4 rate limits by switching to ChatGPT if they occur. Additionally, we could limit spamming by slowing down the streaming if a single user sends queries at an impossible speed.

ishaan-jaff commented 7 months ago

@mruwnik @ccstan99 @henri123lemoine

i'm the maintainer of LiteLLM we allow you to maximize your throughput/increase rate limits - load balance between multiple deployments (Azure, OpenAI) I believe litellm can be helpful here - and i'd love your feedback if we're missing something

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)