Closed RobinFrcd closed 2 years ago
Hello, From your description, you seem to be using the best possible environment/worker configuration already.
Is there a way not to block one NeuronCore per model ?
With torchserve & Neuron, the number of worker processes cannot exceed to the number of NeuronCores.
This is due to a number of constraints when using torchserve/Neuron:
On a single GPU, it's possible to serve multiple models as long as the VRAM permits it. Is it possible to do the same with inferentia nodes ?
Given that you are attempting to serve many models, it is possible to load multiple different models to a single NeuronCore from each worker process. This would require loading multiple models in your handler class and then selecting the correct model based on the request input data. Keep in mind that only one model can be executing on a NeuronCore at once so by serving multiple models per NeuronCore you will incur a small swapping penalty.
Thank you very much for the very clear answer !
Hi, I'm currently hosting a TorchServe server on ECS with inferentia instances.
I have a lot of models, but they are never running at the same time. Is there a way not to block one NeuronCore per model ? I'm currently running with
NEURON_RT_NUM_CORES=1
on an instance with 4 NeuronCore, but can't run more than 4 torchserve models.On a single GPU, it's possible to serve multiple models as long as the VRAM permits it. Is it possible to do the same with inferentia nodes ?
Thanks