SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
695 stars 179 forks source link

Dynamic change of batch size #932

Open saeid93 opened 1 year ago

saeid93 commented 1 year ago

In some cases we need to be able to change some of the configurations of the deployed models like the batch size on the fly without reloading the model, I think this can be implemented by adding an endpoint that changes the model settings' values.

adriangonz commented 1 year ago

Hey @saeid93 ,

A big chunk of logic (like the adaptive batcher), currently assumes that only needs to be triggered when a model is loaded / unloaded / reloaded. So this wouldn't be a trivial change. We would also need to be careful on which options can be changed on-the-fly, as we wouldn't want a user modifying the model's name, version or runtime.

Is there any reason why reloading the model wouldn't work in this case?

saeid93 commented 1 year ago

Hey @adriangonz , The main issue is that reloading the model will impose a downtime for changes like change of batch variables which technically do not need the reloading. e.g. in the case of changing the model the model should necessarly be re-loaded but some config are changable on the fly like batch variables.

adriangonz commented 1 year ago

As far as I know, model reloading should happen gracefully. As in, it won't replace the model (i.e. unload the old one) until the new version is up and ready. That was, at least, the intention.

Have you noticed downtimes when reloading models?

gawsoftpl commented 1 year ago

I think that this logic is for Kubernetes. You can write a microservice for observing traffic or stats of the model and change batch_size env MLSERVER_MODEL_MAX_BATCH_SIZE in a pod. Kubernetes will create a new pod and close the old one. The Ideal option will be to create its own plugin for Keda

adriangonz commented 1 year ago

Hey @gawsoftpl,

That would totally be the way to handle this in single-model serving scenarios. However, in multi-model serving scenarios, the server becomes a stateful component that manages the model's lifecycle itself.

Having said that, the approach in this case should be similar. That is, you just update your settings and "spin up a new model" (i.e. by sending a new /load request to MLServer), which should reload the model gracefully within the same MLServer pod.

adriangonz commented 1 year ago

Hey @saeid93 ,

Following up on this one, have you had a chance to check if you can see any downtimes when reloading models? As discussed in https://github.com/SeldonIO/MLServer/issues/932#issuecomment-1378467352, the intention is that model reloading should happen gracefully.

saeid93 commented 1 year ago

Hey @adriangonz , The model can be re-loaded in the runtime with minimal disruption using this. However, the problem is I couldn't find any way to modify the model setting with that rest interface in a containerized MLServer. As my original intention is to reload the model with a new batch size. As a hack, I even exec ed into the continer and changed model-settings.json with the new batch size but it seems that the model is getting reloaded with the original settings. Is there a way to inject new settings through the /load rest request?