aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
464 stars 154 forks source link

Unable to port DISK UNET from Kornia to Inf2. Compilation taking hours with no signs of progress #1039

Open kandakji opened 2 days ago

kandakji commented 2 days ago

Hi,

I'm trying to port some models from Kornia. I was able to port NetVlad and LightGlue.

When it comes to Disk, the trace command from torch_neuronx is Input tensor is not an XLA tensor: LazyFloatType although I moved the tensors and the model to the xla device.

So, I started experimenting with torch.jit.trace the compiler runs but is just stuck at this debug entry:

2024-11-20T19:01:27Z INFO 454853 [job.Frontend.0]: Executing: <site-packages>/neuronxcc/starfish/bin/hlo2penguin --input /tmp/ubuntu/neuroncc_compile_workdir/ab6ffbbd-bb47-4739-9ef3-fef030126a68/model.MODULE_16216335577045190367+11b4a2df.hlo_module.pb --out-dir ./ --output penguin.py --layers-per-module=1 --partition --coalesce-all-gathers=false --coalesce-reduce-scatters=false --coalesce-all-reduces=false --emit-tensor-level-dropout-ops --emit-tensor-level-rng-ops --expand-batch-norm-training --enable-native-kernel --native-kernel-auto-cast=matmult-to-bf16

fayyadd commented 1 hour ago

Thank you for reaching out! To help us investigate this issue, can you please share the neuron versions in your environment pip list | grep neuron and the steps to reproduce the issue?