Controlling multiple copies of same model within multi-model-server

awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference

Apache License 2.0

998 stars 230 forks source link

As per mxnet inference doc, the main dispatcher thread is single threaded. https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet

How does mxnet model server handle multiple concurrent/parallel request for a particular model? When we load a model inside multi-model-server, does it apply forking/multiprocessing to host multiple copies of the same model to improve the throughput/latency? If yes, What is the default value and is there any config to decide how many copies of the model will be spawned?

Also, what does the worker thread actually perform? https://github.com/awslabs/multi-model-server/blob/master/mms/model_service_worker.py#L166-L212

Any guide and pointer to code will be highly appreciated.

@Kuntal-G

How does mxnet model server handle multiple concurrent/parallel request for a particular model?

MMS queues requests coming in for a particular model and serves them. Current MMS uses Netty for HTTP request/response handling. I am not sure what you were looking for in this question.

When we load a model inside multi-model-server, does it apply forking/multiprocessing to host multiple copies of the same model to improve the throughput/latency?

Yes. if you have preload_model set to true, it uses fork semantics to create newer instances of model workers. This works on Unix based systems.

What is the default value and is there any config to decide how many copies of the model will be spawned?

default_workers_per_model

The source code is the document itself currently. If you have specific questions about code, we could probably answer them :) .

awslabs / multi-model-server

Controlling multiple copies of same model within multi-model-server #901