Deploying for 1000+ Users

orgh0 commented 5 months ago

Hi, thanks for the awesome work on this project !!

What might be the best way to get this project to work for scale? I have seen the docker images released, is deploying with kubernetes a sustainable solution?

We only need the smallest model, but GPU inferencing is not an option for us.

Any support would be super helpful.

zoq commented 5 months ago

Great to hear you find the project helpful. To serve multiple users I would suggest to look into batching, something that is on the roadmap but currently not supported. After that, you probably want some kind of router/load balancer to forward users to the correct endpoint.

orgh0 commented 5 months ago

@zoq - thanks for the quick response, really appreciate it.

Quick Questions:

Do you have any suggestions on resources to get started with batching? when you say batching, i'm assuming you're taking about sending batches of data to the model. I'm trying to understand how that will be helpful to scale the model to more users on CPU. Do you suggest batching requests from multiple users in one batch? Hard to imagine how i can batch requests from one user into a batch form, although given realtime in a continuous stream, there could be sometime there
Is there a particular reference architecture for building live inferencing infrastructure for scale that you're referencing when you talk about router/load-balancer with batching in model. I'm unclear on how you're visualizing batching and request redirection working together.

collabora / WhisperLive

Deploying for 1000+ Users #217