aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
458 stars 153 forks source link

facebook/detr-resnet-50 Compile Error #844

Open feknall opened 8 months ago

feknall commented 8 months ago

I'm getting compile errors for detr-resent-50 model on Aws Inf2.

Neuron Packages:

aws-neuronx-collectives.x86_64         2.20.11.0_c101c322e-1              @neuron     
aws-neuronx-dkms.noarch                2.15.9.0-dkms                      @neuron     
aws-neuronx-runtime-lib.x86_64         2.20.11.0_b7d33e68b-1              @neuron     
aws-neuronx-tools.x86_64               2.17.0.0-1                         @neuron  

Pip Packages:

torch                         2.1.2
torch-neuronx                 2.1.1.2.0.1b0
torch-xla                     2.1.1
torchvision                   0.16.2
transformers                  4.38.2
timm                          0.9.16

AMI inf2.xlarge:

al2023-ami-2023.3.20240219.0-kernel-6.1-x86_64

Reproduce it by:

import torch
import torch_neuronx
from transformers import DetrForObjectDetection

x = torch.rand([1, 3, 640, 640], dtype=torch.float32)

model_config = 'facebook/detr-resnet-50'
model = DetrForObjectDetection.from_pretrained(model_config, torchscript=True)
model.eval()

with torch.inference_mode():
    neuron_model = torch_neuronx.trace(model, x)

print(neuron_model(x))

Logs:

...
2024-03-01T20:56:44Z INFO 34080 [LateTongaInstComb]: Finished (changed=True)
2024-03-01T20:56:44Z USER 34080 [sg0000/Tensorizer/LateTongaInstComb]: LateTongaInstComb finished after 0.432 seconds
2024-03-01T20:56:44Z USER 34080 [sg0000/Tensorizer/SplitAccGrp]: Running SplitAccGrp
2024-03-01T20:56:44Z INFO 34080 [SplitAccGrp]: Finished (changed=False)
2024-03-01T20:56:44Z USER 34080 [sg0000/Tensorizer/SplitAccGrp]: SplitAccGrp finished after 0.021 seconds
2024-03-01T20:56:44Z USER 34080 [sg0000/Tensorizer/SpillPSum]: Running SpillPSum
2024-03-01T20:56:45Z INFO 34080 [SpillPSum]: Finished (changed=True)
2024-03-01T20:56:45Z USER 34080 [sg0000/Tensorizer/SpillPSum]: SpillPSum finished after 0.276 seconds
2024-03-01T20:56:45Z USER 34080 [sg0000/Tensorizer/LowerIntrinsics]: Running LowerIntrinsics
2024-03-01T20:56:45Z USER 34080 [sg0000/Tensorizer/LowerIntrinsics]: LowerIntrinsics finished after 0.004 seconds
2024-03-01T20:56:45Z ERROR 34080 [Tensorizer]: Transformation error on operator: _custom-call.4006
2024-03-01T20:56:45Z ERROR 34080 [NeuronAssert]: Assertion failure in usr/lib64/python3.9/multiprocessing/process.py at line 108
2024-03-01T20:56:45Z INFO 34080 [root/Tensorizer/All]: Exit time region: delta=40.920s
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: ***************************************************************
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:  An Internal Compiler Error has occurred
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: ***************************************************************
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: 
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]: [TEN404] (_custom-call.4006) Internal tensorizer error - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: 
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: Internal details:
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: Type: <class 'neuronxcc.logging.Assert.NeuronAssertionError'>
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/CommandDriver.py", line 343, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1184, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1143, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1160, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1163, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/jobs/Frontend.py", line 405, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/jobs/Frontend.py", line 205, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/driver/jobs/Frontend.py", line 210, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Penguin.py", line 329, in neuronxcc.starfish.penguin.Penguin.runPenguin
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Frontend.py", line 138, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Frontend.py", line 139, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Frontend.py", line 147, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Frontend.py", line 253, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaFromFile
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Compile.py", line 236, in neuronxcc.starfish.penguin.Compile.compile_module
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Compile.py", line 239, in neuronxcc.starfish.penguin.Compile.compile_module
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/Compile.py", line 290, in neuronxcc.starfish.penguin.Compile.compile_module
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 499, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformModule
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 520, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformFunction
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 531, in neuronxcc.starfish.penguin.DotTransform.PassManager.transformFunction
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 162, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runOnFunction
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 240, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 255, in neuronxcc.starfish.penguin.DotTransform.DotTransform.rethrow_exception
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/logging/Assert.py", line 75, in neuronxcc.logging.Assert.neuron_assert
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]: Cause:
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 228, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 268, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 270, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 271, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 300, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 302, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 398, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 399, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 390, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runTransforms
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 379, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmts
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 153, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 431, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 434, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 153, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/targets/transforms/LowerIntrinsics.py", line 1311, in neuronxcc.starfish.penguin.targets.transforms.LowerIntrinsics.LowerIntrinsics.transformInternalNativeBaremetalKernel
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/targets/transforms/LowerIntrinsics.py", line 38, in neuronxcc.starfish.penguin.targets.transforms.LowerIntrinsics.inline
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/targets/transforms/KernelBuilder.py", line 2347, in neuronxcc.starfish.penguin.targets.transforms.KernelBuilder.TraceKernel.inline
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/starfish/penguin/targets/transforms/KernelBuilder.py", line 2348, in neuronxcc.starfish.penguin.targets.transforms.KernelBuilder.TraceKernel.inline
2024-03-01T20:56:45Z ERROR 34080 [neuronxcc.driver.CommandDriver]:   File "neuronxcc/thor/kernels.py", line 1441, in neuronxcc.thor.kernels.resize_nearest_fixed_dma_kernel
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]: 
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]: Diagnostic information:
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   NeuronX Compiler version 2.12.68.0+4480452af
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   Python version 3.9.16
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   HWM version 2.12.0.0-422c9037c
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   NumPy version 1.25.2
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   Running on AMI ami-0440d3b780d96b29d
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]:   Running in region use1-az6
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]: 
2024-03-01T20:56:45Z USER 34080 [neuronxcc.driver.CommandDriver]: Diagnostic logs stored in /home/ec2-user/log-neuron-cc.txt
2024-03-01T20:56:45Z INFO 34080 [neuronxcc.driver.CommandDriver]: Artifacts stored in: /home/ec2-user/neuronxcc-rksxc8qh
2024-03-01T20:56:45Z INFO 34073 [root]: Subcommand returned with exitcode=70
jeffhataws commented 8 months ago

Thanks @feknall , we have reproduced the issue and will take a look.

feknall commented 5 months ago

@jeffhataws Hey. Isn't there any update or an estimate for the fix?