aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
464 stars 154 forks source link

Unable to trace a ViT #807

Open z-at-drigmo opened 10 months ago

z-at-drigmo commented 10 months ago

Hi,

I'm trying to trace the vision encoder part of Meta's Segment Anything Model (SAM), and I'm encountering several errors during the trace process but it seems to be stuck now.

The script doesn't seem to be consuming CPU or mem anymore, it just continues to output "..." (been >12 hrs now).

The following error seems to happen numerous times in the logs:

INFO:Neuron:Compiling with command line: '/home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35'
2023-12-29 11:39:47,031 - Neuron - INFO - Compiling with command line: '/home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35'
.12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: ***************************************************************
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:  An Internal Compiler Error has occurred
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: ***************************************************************
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: 
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Error message:  Traceback (most recent call last):
  [bt] (5) /home/ubuntu/env/lib/python3.8/site-packages/tvm/libtvm.so(TVMFuncCall+0x52) [0x7effcf58c442]
  [bt] (4) /home/ubuntu/env/lib/python3.8/site-packages/tvm/libtvm.so(+0xb456f4) [0x7effcf30d6f4]
  [bt] (3) /home/ubuntu/env/lib/python3.8/site-packages/tvm/libtvm.so(+0xb44155) [0x7effcf30c155]
  [bt] (2) /home/ubuntu/env/lib/python3.8/site-packages/tvm/libtvm.so(+0xba3d3b) [0x7effcf36bd3b]
  [bt] (1) /home/ubuntu/env/lib/python3.8/site-packages/tvm/libtvm.so(+0xba1e40) [0x7effcf369e40]
  [bt] (0) /home/ubuntu/env/lib/python3.8/site-packages/tvm/libtvm.so(+0x5edf75) [0x7effcedb5f75]
  File "/opt/brazil-pkg-cache/packages/DmlcTvm/DmlcTvm-1.18.2.0/AL2_x86_64/generic-flavor/src/src/relay/pass/partition_graph.cc", line 850
TVMError: Check failed: found: Could not find primary output
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: 
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Error class:    TVMError
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Error location: Unknown
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Command line:   /home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: 
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Internal details:
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/CommandDriver.py", line 224, in neuroncc.driver.CommandDriver.CommandDriver.run
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/commands/CompileCommand.py", line 580, in neuroncc.driver.commands.CompileCommand.CompileCommand.run
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/commands/CompileCommand.py", line 558, in neuroncc.driver.commands.CompileCommand.CompileCommand.runPipeline
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/commands/CompileCommand.py", line 562, in neuroncc.driver.commands.CompileCommand.CompileCommand.runPipeline
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/jobs/Frontend.py", line 433, in neuroncc.driver.jobs.Frontend.Frontend.runSingleInput
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/jobs/Frontend.py", line 383, in neuroncc.driver.jobs.Frontend.Frontend.runTVMFrontend
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/jobs/Frontend.py", line 384, in neuroncc.driver.jobs.Frontend.Frontend.runTVMFrontend
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "neuroncc/driver/jobs/Frontend.py", line 388, in neuroncc.driver.jobs.Frontend.Frontend.runTVMFrontend
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "/home/ubuntu/env/lib/python3.8/site-packages/tvm/relay/build_module.py", line 762, in build_graph
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:     unit_level_func, func = ir_pass.color_graph(func, num_tpbs,
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "/home/ubuntu/env/lib/python3.8/site-packages/tvm/relay/ir_pass.py", line 1258, in color_graph
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:     funcs = _ir_pass.color_graph(expr, numTpbs_Int, sbufSize_Int,
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   File "/home/ubuntu/env/lib/python3.8/site-packages/tvm/_ffi/_ctypes/function.py", line 190, in __call__
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:     raise get_last_ffi_error()
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: 
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Version information:
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   Neuron Compiler version 1.21.0.0+6cae3b7b8
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   HWM version 1.16.2.0-295b36a2b
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   NEFF version Dynamic
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   TVM version 1.18.2.0+0
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   NumPy version 1.22.2
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   MXNet not available
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]:   TF not available
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: 
12/29/2023 11:39:49 AM ERROR 9766 [neuron-cc]: Artifacts stored in: /tmp/tmpror45kc0

Compiler status ERROR
INFO:Neuron:Compile command returned: 1
2023-12-29 11:39:50,191 - Neuron - INFO - Compile command returned: 1
WARNING:Neuron:torch.neuron.trace failed on _NeuronGraph$1149; falling back to native python function call
2023-12-29 11:39:50,191 - Neuron - WARNING - torch.neuron.trace failed on _NeuronGraph$1149; falling back to native python function call
ERROR:Neuron:neuron-cc failed with the following command line call:
/home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35
Traceback (most recent call last):
  File "/home/ubuntu/env/lib/python3.8/site-packages/torch_neuron/convert.py", line 413, in op_converter
    neuron_function = self.subgraph_compiler(
  File "/home/ubuntu/env/lib/python3.8/site-packages/torch_neuron/decorators.py", line 263, in trace
    raise subprocess.SubprocessError(
subprocess.SubprocessError: neuron-cc failed with the following command line call:
/home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35
2023-12-29 11:39:50,191 - Neuron - ERROR - neuron-cc failed with the following command line call:
/home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35
Traceback (most recent call last):
  File "/home/ubuntu/env/lib/python3.8/site-packages/torch_neuron/convert.py", line 413, in op_converter
    neuron_function = self.subgraph_compiler(
  File "/home/ubuntu/env/lib/python3.8/site-packages/torch_neuron/decorators.py", line 263, in trace
    raise subprocess.SubprocessError(
subprocess.SubprocessError: neuron-cc failed with the following command line call:
/home/ubuntu/env/bin/neuron-cc compile /tmp/tmpror45kc0/model --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpror45kc0/graph_def.neff --verbose 35
aws-taylor commented 10 months ago

Hello @z-at-drigmo,

Are you able to share a minimal reproduction of this error? We have seen a similar error in the past and the issue was ultimately related to duplicate output names. Without more information, it is hard to say if that is what is happening here though.

-Taylor

z-at-drigmo commented 10 months ago

Hi Taylor,

Yes, I can reproduce it when tracing the image encoder part of Meta's Segment Anything model.

import torch
import torch_neuron

from transformers import SamModel

device = "cuda" if torch.cuda.is_available() else "cpu"
sam = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)

model = sam.vision_encoder
model.eval()

dummy_inputs = {
    "pixel_values": torch.randn(1, 3, 1024, 1024, dtype=torch.float)
}

traced_model = torch.neuron.trace(model, tuple(dummy_inputs.values()), separate_weights=True)

Let me know if you need any other information.