Controlling multiple copies of same model within Sagemaker multi-model-server

Kuntal-G commented 4 years ago

As per mxnet inference doc, the main dispatcher thread is single threaded. https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet

And sagemaker inference toolkit calls/start mxnet multi-model-server. https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L54

When I load a single model in the sagemkaer endpoint with multi-model-server, the Cloudwatch metrics for TotalModelCount shows as 4 (instead of 1). Similarly for 2 models, the count increase to 8. Could you please explain the reason??

How does mxnet model server handle multiple concurrent/parallel request for a particular model? When we load a model inside multi-model-server, does it apply forking/multiprocessing to host multiple copies of the same model to improve the throughput/latency? If yes, What is the default value and is there any config to decide how many copies of the model will be spawned?

Also, what does the worker thread actually perform? https://github.com/awslabs/multi-model-server/blob/master/mms/model_service_worker.py#L166-L212

Any guide or pointer will be highly appreciated.

Kuntal-G commented 4 years ago

Also, I'm evaluating Sagemaker mxnet multi-model-server for our production usecase.

Is there a way how I can explicitly unload a particular model from inside the sagemaker hosted multi-model container? Mxnet multi-model-server provide an Unregister model API (https://github.com/awslabs/multi-model-server/blob/master/docs/management_api.md#register-a-model), but couldnot find anything on Sagemaker SDK to do that. And, if this model unloading is internally handle by sagemaker, on what basis does it unload a model and allocate free space for new model? Whats the unloading criteria when the number of model increases inside a container? And any possibility to customize the unloading logic

ajaykarpur commented 4 years ago

Hi @Kuntal-G, the number of workers per model is set using the SAGEMAKER_MODEL_SERVER_WORKERS environment variable. This is used to configure MMS with the number of workers per model: https://github.com/aws/sagemaker-inference-toolkit/blob/a5b8c4d2e0c47a9f7b537aa137166556fdb7f45a/src/sagemaker_inference/model_server.py#L151

For more details on this functionality (and other configuration options), you can refer to this MMS documentation: https://github.com/awslabs/multi-model-server/blob/master/docs/configuration.md

laurenyu commented 4 years ago

closing due to lack of activity.

aws / sagemaker-inference-toolkit

Controlling multiple copies of same model within Sagemaker multi-model-server #40