Add bounded queue support

caikit / caikit

Caikit is an AI toolkit that enables users to manage models through a set of developer friendly APIs.

Apache License 2.0

97 stars 66 forks source link

Is your feature request related to a problem? Please describe.

We had previously assumed that all usages of the caikit runtime would be deployed with https://github.com/kserve/modelmesh, which would handle all performance and scaling concerns. This is becoming less true as the caikit runtime is used for serverless deployments, and running LLMs which model-mesh is not exactly designed to handle.

As a result of this assumption, the caikit runtime has unbounded work queues in the grpc server. This is not ideal when you're trying to run a high load of requests against a caikit runtime cluster and don't want your server queues to have lots of requests sitting in them while you're trying to scale up.

Additional context

We currently use a basic ThreadPoolExecutor to handle service requests, presumably its internal queue is holding all of the yet-to-be-processed requests but I'm not 100% sure on that. I don't know that you can set limits on that queue, or describe behavior to use when it is full. There might be another standard thread pool that does have these features, it would be great to avoid writing our own if possible. (I haven't dug too deeply into this)

Hi @joerunde I'm the maintainer of LiteLLM we implemented a request queue for making LLM API calls (to any LLM)

Our queue can handle 100 request/second
We can add bounded support to our queue if you'd be willing to try it / give us feedback

Here's a quick start on this:

docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

Add Redis credentials in a .env file

REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted

Start litellm server with your model config

$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for gpt-3.5-turbo

config.yaml (This will load balance between OpenAI + Azure endpoints)

model_list: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: gpt-3.5-turbo
      api_key: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: azure/chatgpt-v-2 # actual model name
      api_key: 
      api_version: 2023-07-01-preview
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/

Test (in another window) → sends 100 simultaneous requests to the queue

$ litellm --test_async --num_requests 100

Available Endpoints

/queue/request - Queues a /chat/completions request. Returns a job id.
/queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.

caikit / caikit