Closed sallamander317 closed 1 month ago
Hi @sallamander317, This seems to be a model issue not a Neuron issue. Have you successfully compiled and executed the model on a CPU (without involving Neuron)? Also, what version of the Neuron SDK are you using?
Hi @aws-donkrets.. can you clarify what you mean by compiled / executed on the CPU?
I think this will answer your question, but just want to make sure compile doesn't relate to any kind of ONNX and / or torch.compile
call. We've successfully trained this model on a GPU, and also successfully run inference on a CPU and GPU before (but on a AWS device with neuron cores). I guess this begs the question - is it a requirement that we have to train our model on a neuron core in order to run inference on a neuron core? That wasn't clear to me in the documentation, but I very well could have missed it.
Here all the versions of the neuron packages we're using:
torch.compile
My intent was to see if you were successful in training and running inferences on the model independent of Neuron. Your answer said that you were. However...BTW, it is NOT a requirement to train a model on Neuron in order to run inferences on Neuron. In fact, many users train on a GPU and then deploy their model on Neuron to take advantage of the sometimes better price/performance Neuron offers so you didn't miss anything in the documentation.
@sallamander317 checking back - were you able to test on regular pytorch / CPU. i.e. just run the model against the tensor without sending it off CPU or compiling.
model = MyRetinaNet(fpath_to_my_model, device)
model.eval()
input_tensor = torch.randn(1, 3, 1280, 896)
model(input_tensor)
We'll plan to close this issue if we don't hear back in the next week.
@mrnikwaws I have not been able to test and we're currently holding on this until we find more time. Feel free to close and when we get around to testing I can circle back. Appreciate you checking in!
Thanks @sallamander317,
Resolving for now. If you need further support don't hesitate to reach out to us again.
I'm getting the following error when trying to compile a model (code snippet below). It appears that the file it points to is only accessible while the model is actively compiling, and throwing an
import pdb; pdb.set_trace()
in the Python code does not stop execution flow such that I can inspect that file. Even if I could, it's not clear to me where to go from there - it appears that this error is due to something in the underlying model defined inpytorch
, but the error itself does not point to where that issue is.Code used to try to compile (with private model name removed):
I'm on Python 3.10.6... all torch libraries in the environment:
I'm on an Amazon Linux 2 instance.