aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
468 stars 154 forks source link

torch.linalg.inv crashes the torch_neuronx.trace #816

Open priyamharsh14 opened 10 months ago

priyamharsh14 commented 10 months ago

I have a model which uses the torch.linalg.inv function in the middle of other calculations. When I tried to compile my model, it failed. I added some breakpoints to figure out where in my model the compilation was failing. I managed to determine that the torch.linalg.inv function is causing trouble related to memory.

import torch
import torch_neuronx

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        return torch.linalg.inv(x)

X = torch.eye(4)
model = Model()
model.eval()
neuron_model = torch_neuronx.trace(model, X)
torch.jit.save(neuron_model, "inv.pt")

Above is a dummy model code I created based on my findings. I ran the above script, and it crashed again. Here are the logs:

2024-01-15 08:39:50.000723:  35873  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-15 08:39:50.000724:  35873  INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_122691202107590206+d41d8cd9/model.neff. Exiting with a successfully compiled graph.
2024-01-15T08:39:51Z Compilation is optimized for best performance and compilation time. For faster compilation time please use -O1
Process Process-1:
Traceback (most recent call last):
  File "neuronxcc/driver/CommandDriver.py", line 343, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
  File "neuronxcc/driver/commands/CompileCommand.py", line 1184, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
  File "neuronxcc/driver/commands/CompileCommand.py", line 1143, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/commands/CompileCommand.py", line 1160, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/commands/CompileCommand.py", line 1163, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
  File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
  File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
  File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
  File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
  File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
  File "neuronxcc/driver/jobs/Frontend.py", line 405, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
  File "neuronxcc/driver/jobs/Frontend.py", line 203, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
  File "neuronxcc/driver/jobs/Frontend.py", line 186, in neuronxcc.driver.jobs.Frontend.Frontend.runHlo2Tensorizer
neuronxcc.driver.Exceptions.CompilerInvalidInputException: ERROR: Failed command  /home/ubuntu/venv/lib/python3.10/site-packages/neuronxcc/starfish/bin/hlo2penguin --input /tmp/tmpy_79ntzg/model --out-dir ./ --output penguin.py --layers-per-module=1 --coalesce-all-gathers=false --coalesce-reduce-scatters=false --coalesce-all-reduces=false --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops
------------ 
Reported stdout: 
INFO: Found memory bound graph
Replaced 0 dropout sequences with OffloadedDropout
INFO: HloMacCount has found 0
INFO: Traffic has found 136
INFO: AIF 0
HLO Ops used in computation: add broadcast compare constant custom-call get-tuple-element iota parameter select transpose triangular-solve tuple 
%14 = "mhlo.triangular_solve"(%12, %13) {left_side = true, lower = false, transpose_a = #mhlo<transpose NO_TRANSPOSE>, unit_diagonal = false} : (tensor<4x4xf32>, tensor<4x4xf32>) -> tensor<4x4xf32>
double free or corruption (out)

------------ 
Reported stderr: 
None
------------                       
Import of the HLO graph into the Neuron Compiler has failed.                       
This may be caused by unsupported operators or an internal compiler error.                       
More details can be found in the error message(s) above.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "neuronxcc/driver/CommandDriver.py", line 350, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand_in_process
  File "neuronxcc/driver/CommandDriver.py", line 345, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
  File "neuronxcc/driver/CommandDriver.py", line 111, in neuronxcc.driver.CommandDriver.handleError
  File "neuronxcc/driver/GlobalState.py", line 102, in neuronxcc.driver.GlobalState.FinalizeGlobalState
  File "neuronxcc/driver/GlobalState.py", line 82, in neuronxcc.driver.GlobalState._GlobalStateImpl.shutdown
  File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/neuronxcc-0u8ihkgq'
Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 16, in <module>
    neuron_model = torch_neuronx.trace(model, X)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 386, in trace
    neff_filename, metaneff, flattener, packer = _trace(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 479, in _trace
    neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 283, in hlo_compile
    raise RuntimeError(f"neuronx-cc failed with {status}")
RuntimeError: neuronx-cc failed with 1
jluntamazon commented 10 months ago

Hello @priyamharsh14,

Currently we don't support the torch.linalg.inv operation. We will add support in a future release.

priyamharsh14 commented 10 months ago

@jluntamazon thanks for your reply. Can you please provide any workaround for this?