TorchServe serve more models than number of NeuronCore

RobinFrcd commented 2 years ago

Hi, I'm currently hosting a TorchServe server on ECS with inferentia instances.

I have a lot of models, but they are never running at the same time. Is there a way not to block one NeuronCore per model ? I'm currently running with NEURON_RT_NUM_CORES=1 on an instance with 4 NeuronCore, but can't run more than 4 torchserve models.

On a single GPU, it's possible to serve multiple models as long as the VRAM permits it. Is it possible to do the same with inferentia nodes ?

Thanks

jluntamazon commented 2 years ago

Hello, From your description, you seem to be using the best possible environment/worker configuration already.

Is there a way not to block one NeuronCore per model ?

With torchserve & Neuron, the number of worker processes cannot exceed to the number of NeuronCores.

This is due to a number of constraints when using torchserve/Neuron:

In torchserve each worker is a process
The Neuron runtime restricts each NeuronCore to only be available to a single process
The Neuron runtime is not fork-safe, so it is not possible to share state between processes

On a single GPU, it's possible to serve multiple models as long as the VRAM permits it. Is it possible to do the same with inferentia nodes ?

Given that you are attempting to serve many models, it is possible to load multiple different models to a single NeuronCore from each worker process. This would require loading multiple models in your handler class and then selecting the correct model based on the request input data. Keep in mind that only one model can be executing on a NeuronCore at once so by serving multiple models per NeuronCore you will incur a small swapping penalty.

RobinFrcd commented 2 years ago

Thank you very much for the very clear answer !

aws-neuron / aws-neuron-sdk

TorchServe serve more models than number of NeuronCore #441