Dense / inaccessible stack trace while compiling model

sallamander317 commented 4 months ago

I'm getting the following error when trying to compile a model (code snippet below). It appears that the file it points to is only accessible while the model is actively compiling, and throwing an import pdb; pdb.set_trace() in the Python code does not stop execution flow such that I can inspect that file. Even if I could, it's not clear to me where to go from there - it appears that this error is due to something in the underlying model defined in pytorch, but the error itself does not point to where that issue is.

Code used to try to compile (with private model name removed):

from pathlib import Path

import torch
import torch_neuronx

from my_model_module import MyRetinaNet

fpath_to_my_model = "fpath_to_model"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = MyRetinaNet(fpath_to_my_model, device)
model.eval()
input_tensor = torch.randn(1, 3, 1280, 896)
input_tensor = input_tensor.to(device)
model_neuron = torch_neuronx.trace(model, input_tensor)

I'm on Python 3.10.6... all torch libraries in the environment:

I'm on an Amazon Linux 2 instance.

aws-donkrets commented 4 months ago

Hi @sallamander317, This seems to be a model issue not a Neuron issue. Have you successfully compiled and executed the model on a CPU (without involving Neuron)? Also, what version of the Neuron SDK are you using?

sallamander317 commented 4 months ago

Hi @aws-donkrets.. can you clarify what you mean by compiled / executed on the CPU?

I think this will answer your question, but just want to make sure compile doesn't relate to any kind of ONNX and / or torch.compile call. We've successfully trained this model on a GPU, and also successfully run inference on a CPU and GPU before (but on a AWS device with neuron cores). I guess this begs the question - is it a requirement that we have to train our model on a neuron core in order to run inference on a neuron core? That wasn't clear to me in the documentation, but I very well could have missed it.

Here all the versions of the neuron packages we're using:

aws-donkrets commented 4 months ago

@sallamander317

Looks like you are using nearly the latest SDK release so that's good.
Sorry if my question was confusing. I was not referring to ONNX or torch.compile My intent was to see if you were successful in training and running inferences on the model independent of Neuron. Your answer said that you were. However...
The error message: non-OK-status: INVALID ARGUMENT: Input dimension should be either 1 or equal to the output dimension it is broadcasting into; the 0th operand dimension is 1000, the 0th output dimension is 1.. implies some type of tensor-size mismatch, which is why I was asking if the model was successfully run elsewhere.
You noted this was a private model so I would suggest trying to use the Eager Debug feature to see if it helps to narrow down the issue.

BTW, it is NOT a requirement to train a model on Neuron in order to run inferences on Neuron. In fact, many users train on a GPU and then deploy their model on Neuron to take advantage of the sometimes better price/performance Neuron offers so you didn't miss anything in the documentation.

mrnikwaws commented 3 months ago

@sallamander317 checking back - were you able to test on regular pytorch / CPU. i.e. just run the model against the tensor without sending it off CPU or compiling.

model = MyRetinaNet(fpath_to_my_model, device)
model.eval()
input_tensor = torch.randn(1, 3, 1280, 896)
model(input_tensor)

We'll plan to close this issue if we don't hear back in the next week.

sallamander317 commented 3 months ago

@mrnikwaws I have not been able to test and we're currently holding on this until we find more time. Feel free to close and when we get around to testing I can circle back. Appreciate you checking in!

aws-taylor commented 1 month ago

Thanks @sallamander317,

Resolving for now. If you need further support don't hesitate to reach out to us again.

aws-neuron / aws-neuron-sdk

Dense / inaccessible stack trace while compiling model #836