Open joerunde opened 1 year ago
Hi @joerunde I'm the maintainer of LiteLLM we implemented a request queue for making LLM API calls (to any LLM)
Here's a quick start on this:
docs: https://docs.litellm.ai/docs/routing#queuing-beta
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
$ litellm --config /path/to/config.yaml --use_queue
Here's an example config for gpt-3.5-turbo
config.yaml (This will load balance between OpenAI + Azure endpoints)
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2 # actual model name
api_key:
api_version: 2023-07-01-preview
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
$ litellm --test_async --num_requests 100
/queue/request
- Queues a /chat/completions request. Returns a job id. /queue/response/{id}
- Returns the status of a job. If completed, returns the response as well. Potential status's are: queued
and finished
.
Is your feature request related to a problem? Please describe.
We had previously assumed that all usages of the caikit runtime would be deployed with https://github.com/kserve/modelmesh, which would handle all performance and scaling concerns. This is becoming less true as the caikit runtime is used for serverless deployments, and running LLMs which model-mesh is not exactly designed to handle.
As a result of this assumption, the caikit runtime has unbounded work queues in the grpc server. This is not ideal when you're trying to run a high load of requests against a caikit runtime cluster and don't want your server queues to have lots of requests sitting in them while you're trying to scale up.
Describe the solution you'd like
It should be a relatively simple change to add a configuration option to set a maximum queue size for the server and return a
RESOURCE_EXHAUSTED
error when full.Describe alternatives you've considered
wait_for_ready
in stubs (and check that this all works with the way our runtime server is set up)Additional context
We currently use a basic ThreadPoolExecutor to handle service requests, presumably its internal queue is holding all of the yet-to-be-processed requests but I'm not 100% sure on that. I don't know that you can set limits on that queue, or describe behavior to use when it is full. There might be another standard thread pool that does have these features, it would be great to avoid writing our own if possible. (I haven't dug too deeply into this)