LLProxy
was designed for the task of effectively managing rate limits and scheduling of workload across multiple different LLM based applications. The rate limits for these services are complex, beyond what can easily be configured with the simplest of reverse proxies. LLProxy
addresses this by creating a scheduler that deeply understandings the core LLM providers rate limiting behavior.
openai
]FIFO
]Setup your configuration file:
cp config-example.json config.json
Each provider can be defined as a specific route.
config.json
{
"routes": {
"openai": {
"forward": "https://api.openai.com",
"provider": "openai",
"models": {
"gpt-4": {
"maxQueueSize": 10,
"maxQueueWait": 30,
"rpm": 200,
"tpm": 40000
},
...
}
}
...
}
}
The above creates a route http://proxyhost:8080/openai/... that routes all traffic sent to that route to https://api.openai.com/...
It further defines a scheduler for the gpt-4 model that sets:
maxQueueSize
defines how many requests are allowed to sit in the queue prior to being scheduledmaxQueueWait
defines how long, in seconds, it will allow a request to wait before it starts rejecting additional requests with RateLimit
errors.rpm
the maximum requests per minutetpm
the maximum tokens per minuteRequests and tokens per minute are consumed as requests come in and recover over time. If a request cannot be immediately processed then it will sit in the queue for up to maxQueueWait
seconds, and up to maxQueueSize
items can be outstanding in the queue.
Set a config for every model you want to support.
[Optional] Run tests
./test.sh
[Optional] Look at code coverage
go tool cover -html=coverage.out -o coverage.html
Build the application
./build.sh
Run the application
./llproxy
Direct traffic to your proxy server
import openai
openai.api_base = 'http://<your-proxy-address>:8080/openai/v1'
...