caikit / caikit

Caikit is an AI toolkit that enables users to manage models through a set of developer friendly APIs.
https://caikit.github.io/website/
Apache License 2.0
97 stars 66 forks source link

Add bounded queue support #181

Open joerunde opened 1 year ago

joerunde commented 1 year ago

Is your feature request related to a problem? Please describe.

We had previously assumed that all usages of the caikit runtime would be deployed with https://github.com/kserve/modelmesh, which would handle all performance and scaling concerns. This is becoming less true as the caikit runtime is used for serverless deployments, and running LLMs which model-mesh is not exactly designed to handle.

As a result of this assumption, the caikit runtime has unbounded work queues in the grpc server. This is not ideal when you're trying to run a high load of requests against a caikit runtime cluster and don't want your server queues to have lots of requests sitting in them while you're trying to scale up.

Describe the solution you'd like

It should be a relatively simple change to add a configuration option to set a maximum queue size for the server and return a RESOURCE_EXHAUSTED error when full.

Describe alternatives you've considered

Additional context

We currently use a basic ThreadPoolExecutor to handle service requests, presumably its internal queue is holding all of the yet-to-be-processed requests but I'm not 100% sure on that. I don't know that you can set limits on that queue, or describe behavior to use when it is full. There might be another standard thread pool that does have these features, it would be great to avoid writing our own if possible. (I haven't dug too deeply into this)

ishaan-jaff commented 7 months ago

Hi @joerunde I'm the maintainer of LiteLLM we implemented a request queue for making LLM API calls (to any LLM)

Here's a quick start on this:

docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

  1. Add Redis credentials in a .env file
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
  1. Start litellm server with your model config
$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for gpt-3.5-turbo

config.yaml (This will load balance between OpenAI + Azure endpoints)

model_list: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: gpt-3.5-turbo
      api_key: 
  - model_name: gpt-3.5-turbo
    litellm_params: 
      model: azure/chatgpt-v-2 # actual model name
      api_key: 
      api_version: 2023-07-01-preview
      api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
  1. Test (in another window) → sends 100 simultaneous requests to the queue
$ litellm --test_async --num_requests 100

Available Endpoints