bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
https://bentoml.com
Apache License 2.0
7.16k stars 792 forks source link

Serving multiple models in the same service (or at least docker image) #981

Closed jondoering closed 4 years ago

jondoering commented 4 years ago

Is your feature request related to a problem? Please describe. Many production instances have more than one, or even a handful of, models. Given that the containerized service is quite heavy (300MB - 1GB), it would be great to have support for multiple models in on container to support use cases where you have context-based models (e.g. different pages hit different model endpoints).

Describe the solution you'd like That one container-service has the ability to serve multiple, different models by e.g. having different endpoints. The model artifacts itself could be e.g. cached fast.

Describe alternatives you've considered Having one container per model - this can become quite heavy for multi-model setups where not every model is used all the time.

This would be some example of an ML server that can load multple models (have not deeper looked into it though) https://github.com/awslabs/multi-model-server/blob/master/docs/server.md

parano commented 4 years ago

@jondoering BentoML does support multiple models in the same service and in the same container, here is related documentation: https://docs.bentoml.org/en/latest/concepts.html#packaging-model-artifacts

And here is an example of a Service exposing multiple API endpoints: https://docs.bentoml.org/en/latest/concepts.html#service-with-multiple-apis

jondoering commented 4 years ago

Thanks for the fast reply @parano. I red about it, but my understanding was that this mainly applies when multiple models are chained "For most model serving scenarios, we recommend one model per prediction service, and decouple non-related models into separate services. The only exception is when multiple models are depending on each other, such as the example above." (from the docs)

The use case I want to tackle is - let's say you have 10 independent models for 10 different scenarios but each only applies 1/10 of the time - currently, it is recommended to have 10 different services running, even though each one is idle 9/10 of the time (but consumes memory etc. of the cluster).

In such a case - would it be also recommended to use the approach in the docs?

parano commented 4 years ago

@jondoering got it, great question! If you are already using something like Kubernetes or KNative to schedule the BentoML API server docker container, then yes, we'd still recommend packaging those models separately into their own containers. Frameworks like Knative helps utilize the hardware resources more efficiently with idle model servers. And the isolation of the models helps improve the overall stability of your prediction services(say you have one model with memory leak issue and this will not affect other models).

But if putting multiple models in one container drastically simplifies your deployment workflow, I think it is ok to just put them in one service serving from one container. For production deployment, you might want to consider stress testing your API server and make sure the server does not break under heavy traffic and the system resource is enough to meet your requirements.

parano commented 4 years ago

Closing the issue for now - @jondoering feel free to post to Discussions page if you have any follow-up questions or want to seek advice from the community: https://github.com/bentoml/BentoML/discussions

jondoering commented 4 years ago

Thanks @parano

goodrahstar commented 4 years ago

Hi Team, Do we have any example on how to run all the saved models at once using bentoml cli.

I have the following models saved, now I want to load them all into the memory at once an serve them under the same API request.

Screenshot 2020-10-06 at 12 46 54 PM
yubozhao commented 4 years ago

@goodrahstar Right now, there is no command to run all of the saved bento at once with BentoML CLI. You can use bash script to loop over each saved BentoService and run bentoml run to compare result with the same dataset.

I want to understand your second question better. I am assuming you want to compare prediction results from each iterations, when a http request comes in? There is a good dicussion about similar operation(shadowing) here: https://github.com/bentoml/BentoML/discussions/1051

Feel free to open a dicussion about this topic on https://github.com/bentoml/BentoML/discussions. I would love to hear what the community have to say.