Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
I have a model which uses the torch.linalg.inv function in the middle of other calculations. When I tried to compile my model, it failed. I added some breakpoints to figure out where in my model the compilation was failing. I managed to determine that the torch.linalg.inv function is causing trouble related to memory.
import torch
import torch_neuronx
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
def forward(self, x):
return torch.linalg.inv(x)
X = torch.eye(4)
model = Model()
model.eval()
neuron_model = torch_neuronx.trace(model, X)
torch.jit.save(neuron_model, "inv.pt")
Above is a dummy model code I created based on my findings. I ran the above script, and it crashed again. Here are the logs:
2024-01-15 08:39:50.000723: 35873 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-01-15 08:39:50.000724: 35873 INFO ||NEURON_CC_WRAPPER||: Using a cached neff at /var/tmp/neuron-compile-cache/neuronxcc-2.12.54.0+f631c2365/MODULE_122691202107590206+d41d8cd9/model.neff. Exiting with a successfully compiled graph.
2024-01-15T08:39:51Z Compilation is optimized for best performance and compilation time. For faster compilation time please use -O1
Process Process-1:
Traceback (most recent call last):
File "neuronxcc/driver/CommandDriver.py", line 343, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
File "neuronxcc/driver/commands/CompileCommand.py", line 1184, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
File "neuronxcc/driver/commands/CompileCommand.py", line 1143, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
File "neuronxcc/driver/commands/CompileCommand.py", line 1160, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
File "neuronxcc/driver/commands/CompileCommand.py", line 1163, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
File "neuronxcc/driver/Job.py", line 344, in neuronxcc.driver.Job.SingleInputJob.run
File "neuronxcc/driver/Job.py", line 370, in neuronxcc.driver.Job.SingleInputJob.runOnState
File "neuronxcc/driver/jobs/Frontend.py", line 405, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
File "neuronxcc/driver/jobs/Frontend.py", line 203, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
File "neuronxcc/driver/jobs/Frontend.py", line 186, in neuronxcc.driver.jobs.Frontend.Frontend.runHlo2Tensorizer
neuronxcc.driver.Exceptions.CompilerInvalidInputException: ERROR: Failed command /home/ubuntu/venv/lib/python3.10/site-packages/neuronxcc/starfish/bin/hlo2penguin --input /tmp/tmpy_79ntzg/model --out-dir ./ --output penguin.py --layers-per-module=1 --coalesce-all-gathers=false --coalesce-reduce-scatters=false --coalesce-all-reduces=false --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops
------------
Reported stdout:
INFO: Found memory bound graph
Replaced 0 dropout sequences with OffloadedDropout
INFO: HloMacCount has found 0
INFO: Traffic has found 136
INFO: AIF 0
HLO Ops used in computation: add broadcast compare constant custom-call get-tuple-element iota parameter select transpose triangular-solve tuple
%14 = "mhlo.triangular_solve"(%12, %13) {left_side = true, lower = false, transpose_a = #mhlo<transpose NO_TRANSPOSE>, unit_diagonal = false} : (tensor<4x4xf32>, tensor<4x4xf32>) -> tensor<4x4xf32>
double free or corruption (out)
------------
Reported stderr:
None
------------
Import of the HLO graph into the Neuron Compiler has failed.
This may be caused by unsupported operators or an internal compiler error.
More details can be found in the error message(s) above.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "neuronxcc/driver/CommandDriver.py", line 350, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand_in_process
File "neuronxcc/driver/CommandDriver.py", line 345, in neuronxcc.driver.CommandDriver.CommandDriver.run_subcommand
File "neuronxcc/driver/CommandDriver.py", line 111, in neuronxcc.driver.CommandDriver.handleError
File "neuronxcc/driver/GlobalState.py", line 102, in neuronxcc.driver.GlobalState.FinalizeGlobalState
File "neuronxcc/driver/GlobalState.py", line 82, in neuronxcc.driver.GlobalState._GlobalStateImpl.shutdown
File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
onerror(os.lstat, path, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/neuronxcc-0u8ihkgq'
Traceback (most recent call last):
File "/home/ubuntu/test.py", line 16, in <module>
neuron_model = torch_neuronx.trace(model, X)
File "/home/ubuntu/venv/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 386, in trace
neff_filename, metaneff, flattener, packer = _trace(
File "/home/ubuntu/venv/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 479, in _trace
neff_filename = hlo_compile(model_dir, compiler_workdir, compiler_args)
File "/home/ubuntu/venv/lib/python3.10/site-packages/torch_neuronx/xla_impl/trace.py", line 283, in hlo_compile
raise RuntimeError(f"neuronx-cc failed with {status}")
RuntimeError: neuronx-cc failed with 1
I have a model which uses the
torch.linalg.inv
function in the middle of other calculations. When I tried to compile my model, it failed. I added some breakpoints to figure out where in my model the compilation was failing. I managed to determine that thetorch.linalg.inv
function is causing trouble related to memory.Above is a dummy model code I created based on my findings. I ran the above script, and it crashed again. Here are the logs: