Closed ukus04 closed 9 months ago
Hi ukus04,
With your simple code example on one-hot op, I can reproduce error, which we'll debug further and keep you updated. You're correct, it's not related to UMIMPLEMENTED error.. Currently one-hot is not fully supported due to this error.
Regarding dynamic-slice
UMIMPLEMENTED error at beginning,
Could you try add
export XLA_IR_DEBUG=1
export XLA_HLO_DEBUG=1
before running your script, it should provide the trace info regarding where the dynamic-slice
is introduced. That would help to answer your third question.
I added the env variables with os.environ
before the training code. And I got additional metadata
field for error
metadata={op_type="aten__roll" op_name="IPKernelApp[_instance]/AsyncIOMainLoop[io_loop]/_UnixSelectorEventLoop[asyncio_loop]/IPythonKernel_1/ZMQInteractiveShell_1/ZMQInteractiveShell_1/SOME_CLASS_1/SOME_MODEL[model]/aten__roll" source_file="~~~blabla.py~~~" source_line=558}
And I cannot find "aten::roll" in operator support docs.
Then, is the error caused by unsupported operators? Should they be processed in CPU? (Related to my second question)
-> In inference, if there are unsupported operators in inferentia, they are processed in CPU. So I thought it is also applied to trainium, but maybe not.
Hi @ukus04,
We fixed a few issues in the latest release, the INTERNAL: neuronx-cc compilation failed.
issue triggered in the simple code example should be resolved in the coming release 2.12, please stay tuned.
It's correct in inferentia, unsupported ops can run on CPU. In Tranium, by adding extra mark_steps
before and after the unsupported ops, and move the input tensors to CPU may help in general. It changes graph partition and you can unload a portion of it to CPU (with some performance implications).
As to the dynamic slice support, we currently do not have plans to support in near future.
Thank you. I'll test the code with extra mark_steps
and wait for issue to be resolved.
Hello @ukus04,
We've released 2.12 (and 2.13 and 2.14) and believe this issue has been resolved. Please don't hesitate to re-open if you encounter further problems.
Regards, Taylor
I tried to train my model with torch-neuronx editing the existing code. But in compile time, some errors are seen.
One of them is when compiling one-hot function in torch.
When I run my original code,
UMIMPLEMENTED
error occurs.So I tested one-hot with simple code. Here's the code that I ran to test.
Here's the log
If the tensor with one dimension, one-hot operation compile succeeds.
But when I make a tensor with multiple dim, compile fails with the log above.
Even when I flatten the tensor, It fails with that kind of error.
There are no
UNIMPLEMENTED
errors in the simple code. It's different with the result of running my code.So, I want to ask these things.