aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
424 stars 136 forks source link

why are the neuron cores locked to a process? #468

Closed dingusagar closed 1 year ago

dingusagar commented 1 year ago

This is my current setup and scenario.

Questions / Doubts :

  1. When there is a heavy load (no. of requests) on the first 2 models and no load on the last 2 models, I want the load to be distributed to all neuron cores since they are sitting idle. But in my current setup, each docker container is tied to a single core and hence I cannot have a parallelism of 4. Is there a better configuration to efficiently use all cores.
  2. Wanted to understand why a process needed to lock on to a core, it would have been better if the process accesses the availabe core, does the inference and releases it so that other processes can use it. similar to cpu time slicing.
  3. A process may use only 25 % of the neuron core, in which case it doesn't make sense for a single process to hold on to a neuron core permanently.
  4. Any recommendations on how to independently scale the parallelism of models/containers running? With the current setup, I can only have 4 containers each running a model.
dingusagar commented 1 year ago

any update on this? because of the above concerns, we had to temporarily put the transition to inferentia inference server on hold.

aws-taylor commented 1 year ago

Hello @dingusagar,

We are in the process of drafting some better documentation to help assist in situations like this. The simple answer to your questions is that the Neuron SDK is designed to sit at a lower abstraction layer, underneath multi-model inference frameworks.

Since you mention docker containers - a common technique is to use nginx or haproxy as a load balancer and routing layer in front of your model containers. Usually this is done in conjunction with one of the docker orchestration frameworks such as compose, swarm, or kubernetes. For non-docker or inter-docker situations, there are tools such as the https://github.com/awslabs/multi-model-server that can be used.

awsrjh commented 1 year ago

Closing