Runtime Error when Using Allocation API

nandeeka commented 2 weeks ago

I am trying to use the Allocation API to manually allocate tensors in my NKI kernel. Unfortunately, even with a simple kernel that exponentiates every element, I am seeing an error. I have confirmed that the kernel finish successfully with nki.language.sbuf, but does not work with nki.isa.sbuf.allocate.

I am getting a runtime error:

2024-10-10 21:04:44.000897:  9694  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-10-10 21:04:44.000898:  9694  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.neff --disable-internal-io-dge --verbose=35
.root = neuronxcc/starfish/penguin/targets/transforms/InlineNKIKernels.py
root = neuronxcc/starfish/penguin/targets/transforms
root = neuronxcc/starfish/penguin/targets
root = neuronxcc/starfish/penguin
root = neuronxcc/starfish

2024-10-10 21:04:46.000246:  9694  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.neff', '--disable-internal-io-dge', '--verbose=35']: 2024-10-10T21:04:46Z [TEN404] (_custom-call.1) Internal tensorizer error: InlineNKIKernels:max() arg is an empty sequence - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

2024-10-10 21:04:46.000246:  9694  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.hlo_module.pb after 0 retries.
Traceback (most recent call last):
  File "/home/ubuntu/nki-kernels/out/../src/learn_nki/allocate.py", line 65, in <module>
    print(f"output_nki={out_tensor}")
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/_tensor.py", line 934, in __format__
    return object.__format__(self, format_spec)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/_tensor.py", line 431, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/_tensor_str.py", line 664, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/opt/aws_neuronx_venv_pytorch_2_1/lib/python3.10/site-packages/torch/_tensor_str.py", line 430, in _str_intern
    self = self.to("cpu")
RuntimeError: Bad StatusOr access: INTERNAL: RunNeuronCCImpl: error condition error != 0: <class 'subprocess.CalledProcessError'>: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/e3803c79-ed34-4e58-b916-5eb2100598ed/model.MODULE_14154296256060722173+a5381035.neff', '--disable-internal-io-dge', '--verbose=35']' returned non-zero exit status 70.

Environment: I started with the Neuron 2.20 DLAMI and installed the Allocation API using the .deb and .whl files @aws-serina-tan sent me.

Full Kernel:

"""
Example kernel used to demmonstrate Neuron Profile.
"""
import torch
import neuronxcc.nki.language as nl
import neuronxcc.nki.isa as ni
from torch_neuronx import nki_jit
import math

import os
os.environ["NEURON_FRAMEWORK_DEBUG"] = "1"
os.environ["NEURON_CC_FLAGS"]= " --disable-internal-io-dge "

@nki_jit
def tensor_exp_kernel_(in_tensor, out_tensor):
  """NKI kernel to compute elementwise exponential of an input tensor

  Args:
      in_tensor: an input tensor of ANY 2D shape (up to SBUF size)
      out_tensor: an output tensor of ANY 2D shape (up to SBUF size)
  """
  sz_p, sz_f = in_tensor.shape

  i_f = nl.arange(sz_f)[None, :]
  P1 = math.ceil(sz_p / nl.tile_size.pmax)

  in_tile = nl.ndarray((P1, nl.par_dim(nl.tile_size.pmax), sz_f), dtype=in_tensor.dtype,
              # buffer=nl.sbuf)
              buffer=ni.sbuf.allocate(byte_addr=0, allocated_block_shape=(2,)))

  out_tile = nl.ndarray((P1, nl.par_dim(nl.tile_size.pmax), sz_f), dtype=out_tensor.dtype,
              # buffer=nl.sbuf)
              buffer=ni.sbuf.allocate(byte_addr=2 * sz_f * 4, allocated_block_shape=(2,)))

  bias_tile = nl.ndarray((P1, nl.par_dim(nl.tile_size.pmax), 1), dtype=out_tensor.dtype,
              # buffer = nl.sbuf)
              buffer=ni.sbuf.allocate(byte_addr=2 * 2 * sz_f * 4, allocated_block_shape=(2,)))

  for p in nl.affine_range(P1):
    # Generate tensor indices for the input/output tensors
    # pad index to pmax, for simplicity
    i_p = p * nl.tile_size.pmax + nl.arange(nl.tile_size.pmax)[:, None]

    # Load input data from external memory to on-chip memory
    # only read up to sz_p
    in_tile[p] = nl.load(in_tensor[i_p, i_f])

    bias_tile[p] = ni.memset(shape=(nl.tile_size.pmax, 1), value=0, dtype=in_tensor.dtype)

    # perform the computation
    out_tile[p] = ni.activation(nl.exp, in_tile[p], bias=bias_tile[p])

    # store the results back to external memory
    # only write up to sz_p
    nl.store(out_tensor[i_p, i_f], value=out_tile[p])

if __name__ == "__main__":
  from torch_xla.core import xla_model as xm
  device = xm.xla_device()

  in_tensor = torch.rand((256, 512), dtype=torch.float32).to(device=device)
  out_tensor = torch.zeros((256, 512), dtype=torch.float32).to(device=device)

  tensor_exp_kernel_(in_tensor, out_tensor)
  print(f"output_nki={out_tensor}")

aws-qieqingy commented 2 weeks ago

Hi Nandeeka! Thanks for your issue. This is a known bug in the allocated kernel lowering procedure that has been fixed internally. The fix will be available in the next compiler release.

In the mean time, the bug is only triggered when the kernel do not use any PSUM tensor. You should be able to temporarily bypass the issue by declaring a dummy PSUM tensor inside the kernel.

nandeeka commented 2 weeks ago

Hi @aws-qieqingy, Thanks for getting back to me. I may be declaring this dummy tensor wrong, but I am seeing the same error even after adding the following to my kernel.

dummy = nl.ndarray((nl.par_dim(nl.tile_size.pmax), 1), dtype=in_tensor.dtype, buffer=nl.psum)

Thanks!

aws-qieqingy commented 2 weeks ago

The PSUM tensor also needs to be allocated with nl.psum.allocate(). Please post the error log if you are still encountering error after this.

nandeeka commented 2 weeks ago

I think the issue was that this buffer was never used, so never actually initialized. By using it (e.g., with a nki.isa.memset), I seem to have fixed the problem.

For future reference, how do I produce the error log? The issue with the instructions seems to have been deleted. Thanks!

aws-qieqingy commented 2 weeks ago

You can pass --logfile to NEURON_CC_FLAGS to store the log file. Here's a link to the detailed documentation on compiler flags at Neuron RTD: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/compiler/neuronx-cc/api-reference-guide/neuron-compiler-cli-reference-guide.html

aws-neuron / aws-neuron-sdk

Runtime Error when Using Allocation API #1009