aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
465 stars 154 forks source link

Compiling SD1.5 for Neuron with resolution 512x768 fails #979

Open mludvig opened 2 months ago

mludvig commented 2 months ago

I'm trying to export SD 1.5 into a portrait mode resolution 512x768 for use with Neuron / Inferentia 2. This is my export command:

optimum-cli export neuron \
    --model jyoung105/stable-diffusion-v1-5 \
    --task stable-diffusion \
    --batch_size 1 --num_images_per_prompt 1 \
    --height 768 --width 512 \
    stable-diffusion-v1-5.neuron

It works in 512x512 but fails with 512x768 with this error in the vae_encoder step:

***** Compiling vae_encoder *****
...........
[GCA035]  Instruction: I-5715-0 with opcode: TensorTensor couldn't be allocated in SB
Memory Location Accessed:
add.1_reload_7077_i0: 196608 Bytes per Partition and total of: 25165824 Bytes in SB
_add.1104-t7919_i0: 4 Bytes per Partition and total of: 512 Bytes in SB
add.6_i0: 2048 Bytes per Partition and total of: 262144 Bytes in SB
Total Accessed Bytes per partition by instruction: 198660
Total SB Partition Size: 196608
 - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
An error occured when trying to trace vae_encoder with the error message: neuronx-cc failed with 70.
The export is failed and vae_encoder neuron model won't be stored.

Do I need any other parameters or is it a bug that needs fixing? I'm running it on AWS inf2.2xlarge instance.