aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
423 stars 136 forks source link

[Trn1] Is one-hot operation not fully supported? #679

Closed ukus04 closed 9 months ago

ukus04 commented 1 year ago

I tried to train my model with torch-neuronx editing the existing code. But in compile time, some errors are seen.

One of them is when compiling one-hot function in torch.

When I run my original code, UMIMPLEMENTED error occurs.

     98 def sparse_to_wide(self, input_tensor):
---> 99     return F.one_hot(input_tensor.long(), self.num_classes).float()

RuntimeError: UNIMPLEMENTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) UNIMPLEMENTED: 
The following HLO instructions are not supported by neuronx-cc:
================================================================================

%dynamic-slice = f32[2,128,768] dynamic-slice(f32[2,256,768] %concatenate.61, s64[] %constant.955, s64[] %constant.956, s64[] %constant.955), dynamic_slice_sizes={2,128,768}

So I tested one-hot with simple code. Here's the code that I ran to test.

import torch
import torch.nn.functional as F
import torch_xla.core.xla_model as xm

for _ in range(5):
    t = (torch.rand(8, 128, 20, device="xla") * 20).long()
    F.one_hot(t, 40)
    xm.mark_step()

Here's the log

RuntimeError                              Traceback (most recent call last)
Cell In[8], line 7
      5 for _ in range(5):
      6     t = (torch.rand(8, 128, 20, device="xla") * 20).long()
----> 7     F.one_hot(t, 40)
      8     xm.mark_step()

RuntimeError: INTERNAL: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
     [[XRTExecute_G12]]
  (1) INTERNAL: neuronx-cc compilation failed.
     [[{{node XRTExecute}}]]
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  0 successful operations.
  0 derived errors ignored.
  Recent warning and error logs:
    OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
  OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.

If the tensor with one dimension, one-hot operation compile succeeds.

But when I make a tensor with multiple dim, compile fails with the log above.

Even when I flatten the tensor, It fails with that kind of error.

There are no UNIMPLEMENTED errors in the simple code. It's different with the result of running my code.

So, I want to ask these things.

  1. Is one-hot operation fully supported currently?
  2. If there are not supported oprators in torch-neuronx, can I compile that code?
  3. I want to know why the same function make different errors.
jyang-aws commented 1 year ago

Hi ukus04,

With your simple code example on one-hot op, I can reproduce error, which we'll debug further and keep you updated. You're correct, it's not related to UMIMPLEMENTED error.. Currently one-hot is not fully supported due to this error.

Regarding dynamic-slice UMIMPLEMENTED error at beginning, Could you try add

export XLA_IR_DEBUG=1
export XLA_HLO_DEBUG=1

before running your script, it should provide the trace info regarding where the dynamic-slice is introduced. That would help to answer your third question.

ukus04 commented 1 year ago

I added the env variables with os.environ before the training code. And I got additional metadata field for error

metadata={op_type="aten__roll" op_name="IPKernelApp[_instance]/AsyncIOMainLoop[io_loop]/_UnixSelectorEventLoop[asyncio_loop]/IPythonKernel_1/ZMQInteractiveShell_1/ZMQInteractiveShell_1/SOME_CLASS_1/SOME_MODEL[model]/aten__roll" source_file="~~~blabla.py~~~" source_line=558}

And I cannot find "aten::roll" in operator support docs.

Then, is the error caused by unsupported operators? Should they be processed in CPU? (Related to my second question)

-> In inference, if there are unsupported operators in inferentia, they are processed in CPU. So I thought it is also applied to trainium, but maybe not.

jyang-aws commented 12 months ago

Hi @ukus04,

We fixed a few issues in the latest release, the INTERNAL: neuronx-cc compilation failed. issue triggered in the simple code example should be resolved in the coming release 2.12, please stay tuned.

It's correct in inferentia, unsupported ops can run on CPU. In Tranium, by adding extra mark_steps before and after the unsupported ops, and move the input tensors to CPU may help in general. It changes graph partition and you can unload a portion of it to CPU (with some performance implications).

As to the dynamic slice support, we currently do not have plans to support in near future.

ukus04 commented 12 months ago

Thank you. I'll test the code with extra mark_steps and wait for issue to be resolved.

aws-taylor commented 9 months ago

Hello @ukus04,

We've released 2.12 (and 2.13 and 2.14) and believe this issue has been resolved. Please don't hesitate to re-open if you encounter further problems.

Regards, Taylor