aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
474 stars 156 forks source link

TorchServe serve more models than number of NeuronCore #441

Closed RobinFrcd closed 2 years ago

RobinFrcd commented 2 years ago

Hi, I'm currently hosting a TorchServe server on ECS with inferentia instances.

I have a lot of models, but they are never running at the same time. Is there a way not to block one NeuronCore per model ? I'm currently running with NEURON_RT_NUM_CORES=1 on an instance with 4 NeuronCore, but can't run more than 4 torchserve models.

On a single GPU, it's possible to serve multiple models as long as the VRAM permits it. Is it possible to do the same with inferentia nodes ?

Thanks

jluntamazon commented 2 years ago

Hello, From your description, you seem to be using the best possible environment/worker configuration already.

Is there a way not to block one NeuronCore per model ?

With torchserve & Neuron, the number of worker processes cannot exceed to the number of NeuronCores.

This is due to a number of constraints when using torchserve/Neuron:

On a single GPU, it's possible to serve multiple models as long as the VRAM permits it. Is it possible to do the same with inferentia nodes ?

Given that you are attempting to serve many models, it is possible to load multiple different models to a single NeuronCore from each worker process. This would require loading multiple models in your handler class and then selecting the correct model based on the request input data. Keep in mind that only one model can be executing on a NeuronCore at once so by serving multiple models per NeuronCore you will incur a small swapping penalty.

RobinFrcd commented 2 years ago

Thank you very much for the very clear answer !