why are the neuron cores locked to a process?

dingusagar commented 1 year ago

This is my current setup and scenario.

inf1.xlarge
model inference script is deployed as a docker container.
There are 4 models that I want to deploy on inferentia server. So I spin up 4 docker containers each handling a single model.
Now out of the 4 models, let's say the first 2 models would get more requests and are utilized more compared to the last 2 models. Now every container process locks to a neuron core because I am setting the environment variable NEURON_RT_NUM_CORES=1 in the docker run command.

Questions / Doubts :

When there is a heavy load (no. of requests) on the first 2 models and no load on the last 2 models, I want the load to be distributed to all neuron cores since they are sitting idle. But in my current setup, each docker container is tied to a single core and hence I cannot have a parallelism of 4. Is there a better configuration to efficiently use all cores.
Wanted to understand why a process needed to lock on to a core, it would have been better if the process accesses the availabe core, does the inference and releases it so that other processes can use it. similar to cpu time slicing.
A process may use only 25 % of the neuron core, in which case it doesn't make sense for a single process to hold on to a neuron core permanently.
Any recommendations on how to independently scale the parallelism of models/containers running? With the current setup, I can only have 4 containers each running a model.

dingusagar commented 1 year ago

any update on this? because of the above concerns, we had to temporarily put the transition to inferentia inference server on hold.

aws-taylor commented 1 year ago

Hello @dingusagar,

We are in the process of drafting some better documentation to help assist in situations like this. The simple answer to your questions is that the Neuron SDK is designed to sit at a lower abstraction layer, underneath multi-model inference frameworks.

Since you mention docker containers - a common technique is to use nginx or haproxy as a load balancer and routing layer in front of your model containers. Usually this is done in conjunction with one of the docker orchestration frameworks such as compose, swarm, or kubernetes. For non-docker or inter-docker situations, there are tools such as the https://github.com/awslabs/multi-model-server that can be used.

awsrjh commented 1 year ago

Closing

aws-neuron / aws-neuron-sdk

why are the neuron cores locked to a process? #468