aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
421 stars 136 forks source link

Dense / inaccessible stack trace while compiling model #836

Closed sallamander317 closed 1 month ago

sallamander317 commented 4 months ago

I'm getting the following error when trying to compile a model (code snippet below). It appears that the file it points to is only accessible while the model is actively compiling, and throwing an import pdb; pdb.set_trace() in the Python code does not stop execution flow such that I can inspect that file. Even if I could, it's not clear to me where to go from there - it appears that this error is due to something in the underlying model defined in pytorch, but the error itself does not point to where that issue is.

Screen Shot 2024-02-21 at 1 49 58 PM

Code used to try to compile (with private model name removed):

from pathlib import Path

import torch
import torch_neuronx

from my_model_module import MyRetinaNet

fpath_to_my_model = "fpath_to_model"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = MyRetinaNet(fpath_to_my_model, device)
model.eval()
input_tensor = torch.randn(1, 3, 1280, 896)
input_tensor = input_tensor.to(device)
model_neuron = torch_neuronx.trace(model, input_tensor)

I'm on Python 3.10.6... all torch libraries in the environment:

image

I'm on an Amazon Linux 2 instance.

aws-donkrets commented 4 months ago

Hi @sallamander317, This seems to be a model issue not a Neuron issue. Have you successfully compiled and executed the model on a CPU (without involving Neuron)? Also, what version of the Neuron SDK are you using?

sallamander317 commented 4 months ago

Hi @aws-donkrets.. can you clarify what you mean by compiled / executed on the CPU?

I think this will answer your question, but just want to make sure compile doesn't relate to any kind of ONNX and / or torch.compile call. We've successfully trained this model on a GPU, and also successfully run inference on a CPU and GPU before (but on a AWS device with neuron cores). I guess this begs the question - is it a requirement that we have to train our model on a neuron core in order to run inference on a neuron core? That wasn't clear to me in the documentation, but I very well could have missed it.

Here all the versions of the neuron packages we're using:

image
aws-donkrets commented 4 months ago

@sallamander317

BTW, it is NOT a requirement to train a model on Neuron in order to run inferences on Neuron. In fact, many users train on a GPU and then deploy their model on Neuron to take advantage of the sometimes better price/performance Neuron offers so you didn't miss anything in the documentation.

mrnikwaws commented 3 months ago

@sallamander317 checking back - were you able to test on regular pytorch / CPU. i.e. just run the model against the tensor without sending it off CPU or compiling.

model = MyRetinaNet(fpath_to_my_model, device)
model.eval()
input_tensor = torch.randn(1, 3, 1280, 896)
model(input_tensor)

We'll plan to close this issue if we don't hear back in the next week.

sallamander317 commented 3 months ago

@mrnikwaws I have not been able to test and we're currently holding on this until we find more time. Feel free to close and when we get around to testing I can circle back. Appreciate you checking in!

aws-taylor commented 1 month ago

Thanks @sallamander317,

Resolving for now. If you need further support don't hesitate to reach out to us again.