awslabs / multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Apache License 2.0
998 stars 230 forks source link

Controlling multiple copies of same model within multi-model-server #901

Open Kuntal-G opened 4 years ago

Kuntal-G commented 4 years ago

As per mxnet inference doc, the main dispatcher thread is single threaded. https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet

How does mxnet model server handle multiple concurrent/parallel request for a particular model? When we load a model inside multi-model-server, does it apply forking/multiprocessing to host multiple copies of the same model to improve the throughput/latency? If yes, What is the default value and is there any config to decide how many copies of the model will be spawned?

Also, what does the worker thread actually perform? https://github.com/awslabs/multi-model-server/blob/master/mms/model_service_worker.py#L166-L212

Any guide and pointer to code will be highly appreciated.

vdantu commented 4 years ago

@Kuntal-G

How does mxnet model server handle multiple concurrent/parallel request for a particular model?

MMS queues requests coming in for a particular model and serves them. Current MMS uses Netty for HTTP request/response handling. I am not sure what you were looking for in this question.

When we load a model inside multi-model-server, does it apply forking/multiprocessing to host multiple copies of the same model to improve the throughput/latency?

Yes. if you have preload_model set to true, it uses fork semantics to create newer instances of model workers. This works on Unix based systems.

What is the default value and is there any config to decide how many copies of the model will be spawned?

default_workers_per_model

The source code is the document itself currently. If you have specific questions about code, we could probably answer them :) .