aws-neuron / aws-neuron-sagemaker-samples

MIT No Attribution
11 stars 5 forks source link

Can't use deployed Neuron model with error NRT:nrt_allocate_neuron_cores #9

Open marina-pchelina opened 1 month ago

marina-pchelina commented 1 month ago

Hi, I'm following the sample here to try to compile a model to Neuron and deploy on SageMaker.

Following the steps in the sample exactly, I am able to deploy the model, but it when I try to use it I get the 500 error and my CloudWatch traceback shows the following:

Screenshot 2024-06-13 at 13 14 56

I only saw this error when trying to use a second model in the same instance while another one is running, but that should not be the case here.

jluntamazon commented 1 month ago

Hi @marina-pchelina

Since NeuronCores are reserved per process, it's possible that you have an old process which is holding onto the NeuronCores but has not been properly terminated. One thing to try is to forcefully stop all running processes: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#neuroncore-s-not-available-requested-1-available-0

marina-pchelina commented 1 month ago

hi, thanks for getting back to me! What I don't understand is how can neuron cores be holding on to any old process if a new inference instance is initialized each time I deploy. Anyway, I tried including some commands from the troubleshooting doc and some other I found in the top of my inference.py script like so:

import subprocess

print("Running apt-get install kmod")
print(subprocess.run(['apt-get install kmod'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running lsmod | grep neuron")
print(subprocess.run(['lsmod | grep neuron'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running ps aux | grep python")
print(subprocess.run(['ps aux | grep python'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running neuron-ls")
print(subprocess.run(['neuron-ls'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))
print("Running modinfo neuron")
print(subprocess.run(['modinfo neuron'], stdout=subprocess.PIPE, shell=True).stdout.decode('utf-8'))

I can see the 2 cores are there with neuron-ls: Screenshot 2024-06-14 at 18 17 13

No significant python processes that could be using the cores, killing them all explicitly didn't help ether.

Screenshot 2024-06-14 at 18 17 03

However, seems like I'm not able to use lsmod or modinfo, which I'm able to use and get output from from inside an EC2 instance (same inf2) directly. I tried installing them with apt-get install kmod but that didn't help either.

Screenshot 2024-06-14 at 18 20 48

Could that possibly have something to do with image that's used in the tutorial? It's currently this one: ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-neuronx:1.13.1-neuronx-py310-sdk2.13.2-ubuntu20.04"

jluntamazon commented 1 month ago

@marina-pchelina, we were able to reproduce the problem in the tutorial and are looking into a fix.

The root cause appears to be 2 separate misconfigurations:

  1. The default number of workers is set to 4. In torchserve each worker is a separate process which must take control of a single NeuronCore. Since the ml.inf2.xlarge instance has only 2 NeuronCores, this is an invalid number of workers. You can observe the configuration in the beginning of the logs: Default workers per model: 4
  2. The default number of NeuronCores per worker appears to be not configured which is causing the first process to attempt to take ownership of all NeuronCores. The Neuron runtime allows each process to take control of as many NeuronCores as it needs. The default behavior is that 1 process takes ownership of all NeuronCores on the instance. When using process-level workers, this means that each process should be configured with the environment variable NEURON_RT_NUM_CORES=1 so that it only takes ownership of a single NeuronCores for the model that it loads. You can see this in the logs because 4 warnings are issued for each model load followed by only 3 nrt_allocate_neuron_cores errors showing that the NeuronCores have already been allocated to another process.
marina-pchelina commented 1 month ago

thanks for looking into this! I tried to re-compile the model with --target inf2 on the off chance it might help configure the num of workers, but it still showed Default workers per model: 4 in the logs. If it's any help, I can deploy and use models through the HuggingFace integration, the problem with that is, I want to use both cores with DataParallel, which the HuggingFace class doesn't seem to allow to do. Let me know if there's anything I can do myself to work around that, otherwise, I'll wait for a fix.