Internal Compiler error when compiling a model

alexandrekm commented 2 months ago

I'm encountering a difficulty compiling a particular model to Neuron using the torch_neuron package for Inf1 instances. While torch.jit.trace works as expected, Neuron compilation fails (torch.neuron.trace). I've successfully compiled other models with Neuron in the past, so the issue seems specific to this one.

While I probably can't share the model due to its proprietary nature (similar to the other models I've compiled successfully), the analyze model function seems to return positive results, suggesting the model structure itself might not be the root cause.

INFO:Neuron:99.47% of all operations (including primitives) (3552 of 3571) are supported INFO:Neuron:96.16% of arithmetic operations (451 of 469) are supported

This is the output of the compiler (I have the full DEBUG log in the case it's necessary)

[neuron-cc]: ***************************************************************
[neuron-cc]:  An Internal Compiler Error has occurred
[neuron-cc]: ***************************************************************
[neuron-cc]:
[neuron-cc]: Error message:  invalid literal for int() with base 10: '0.01'
[neuron-cc]: Error class:    ValueError
[neuron-cc]: Error location: Unknown
[neuron-cc]: Version information:
[neuron-cc]:   Neuron Compiler version 1.22.0.0+d4b4f5311
[neuron-cc]:
[neuron-cc]:   HWM version 1.17.0.0-fbcd6c853
[neuron-cc]:   NEFF version Dynamic
[neuron-cc]:   TVM version 1.19.0.0+0
[neuron-cc]:   NumPy version 1.22.2
[neuron-cc]:   MXNet not available
neuron-cc]:   TF not available

Output of pip list for the relevant packages:

tensorflow                   1.15.5.post1
neuron-cc                    1.22.0.0+d4b4f5311
torch                        1.13.1
torch-neuron                 1.13.1.2.9.74.0
torchaudio                   0.13.1
torchvision                  0.14.1

Other system details (I am using an EC2 instance to compile it)

EC2 Instance:  m5a.2xlarge
Memory: 32GB
OS: Ubuntu 22.04

Please let me know if more information is needed or what other tests I need to do

aws-donkrets commented 2 months ago

HI alexandrekm - Since you can't share the actual model I'm responding based upon the error message you provided. The message states: invalid literal for int() with base 10: '0.01'

Is the model attempting to assign a floating point value (0.01) to an integer typed variable?

alexandrekm commented 2 months ago

Hi @aws-donkrets Thanks for taking a look at this.

The model seems to converted from PyTorch to a SavedModel format successfully, but the subsequent neuron-cc compilation step fails.

Troubleshooting Steps:

Cast Analysis: I haven't observed any explicit string-to-integer conversion within the model itself. It's actually a string that contains the '0.01' value. This cast can come from one of the frameworks we use but since I do not have access to a debugger within neuron-cc (where this fails) I am not sure. Is this something that I can do myself?
Compilation Breakdown: The compilation process appears to be two-fold (is this correct?): Stage 1: Converts the PyTorch model to a SavedModel (presumably using torch.jit.trace which succeeds on it's own). Stage 2: Compiles the SavedModel using neuron-cc.

Reproducing the Error:

The failure can be isolated and reproduced by running just the second stage (neuron-cc compilation) with the specific commands extracted from the logs. Here's an example of the recreated command:


  /home/ubuntu/code/neuron-cc-inf1/1/graph_def.pb \
  --framework TENSORFLOW \
  --pipeline compile SaveTemps \
  --output /home/ubuntu/code/neuron-cc-inf1/1/graph_def.neff \
  --io-config '{"inputs": {"tensor.1:0": [[1, 3, 448, 768], "float32"]}, "outputs": "... (list of outputs) ..."}' \
  --verbose 35```

jluntamazon commented 2 months ago

I haven’t observed any explicit string-to-integer conversion within the model itself. It’s actually a string that contains the ‘0.01’ value. This cast can come from one of the frameworks we use but since I do not have access to a debugger within neuron-cc (where this fails) I am not sure. Is this something that I can do myself?

I think the easiest thing you could try yourself is to come up with a minimal reproduction that does not contain any proprietary architectural information. The way you might approach this is to create a model with a single layer (instead of multiple) and then attempt to compile it like before. If this still causes an error, then remove submodules from the end of the layer until just a few operators can reproduce the failure. At this point you should be able to share a minimal set of operations to reproduce the issue.

Compilation Breakdown: The compilation process appears to be two-fold (is this correct?): Stage 1: Converts the PyTorch model to a SavedModel (presumably using torch.jit.trace which succeeds on it’s own). Stage 2: Compiles the SavedModel using neuron-cc.

Yes, exactly correct. Because there are a few stages to compilation, the easiest thing to do is come up with a minimal reproduction so we can determine exactly which component is failing.

alexandrekm commented 2 months ago

I managed to understand what the issue was and disabling a part of the model solved it. Thanks for the help.

aws-neuron / aws-neuron-sdk

Internal Compiler error when compiling a model #863