aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
459 stars 153 forks source link

Segmentation Fault error on compiling Unet for sdxl #748

Closed danishboy000 closed 1 month ago

danishboy000 commented 1 year ago

I'm trying to compile sdxl for neuron, and after compiling unet it throws segmentation fault error.

This is the code I'm using

Instance is inf2.8xlarge as recommended for compilation OS: Amazon Linux 2 Neuron Details: NeuronX Compiler version 2.10.0.34+6c8792c6f

Python version 3.8.16 HWM version 2.10.0.5-7b1976adf NumPy version 1.21.6

Running on AMI ami-021db1e46943d7baa Running in region use1-az5

I've enough disk space as well.

aws-rhsoln commented 1 year ago

Thank you for reporting the issue. Can you share the code and the error log for us to deep dive.

danishboy000 commented 1 year ago

Hi, sorry forgot to share the link for the code https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_sdxl_base_1024_inference.ipynb

This is the code that I'm using, without changing anything

aws-qing commented 1 year ago

Hi @danishboy000 ,

Can you paste the full compile log? In particular, can you confirm that

  1. Compilation succeeds
  2. graph.neff is generated
  3. model.pt is not generated

Thank you

danishboy000 commented 1 year ago

Here is the last part of error trace:

2023-09-22T00:40:18Z Running lower_sync 2023-09-22T00:40:19Z lower_sync finished after 1.064 seconds 2023-09-22T00:40:19Z Running lower_act 2023-09-22T00:40:20Z lower_act finished after 0.691 seconds 2023-09-22T00:40:21Z Running lower_dve 2023-09-22T00:40:23Z lower_dve finished after 2.346 seconds 2023-09-22T00:40:23Z Running lower_ap 2023-09-22T00:40:24Z lower_ap finished after 0.503 seconds 2023-09-22T00:40:24Z Running alloc_regs 2023-09-22T00:40:24Z alloc_regs finished after 0.162 seconds 2023-09-22T00:40:25Z Running birverifier 2023-09-22T00:40:28Z birverifier finished after 3.600 seconds 2023-09-22T00:40:29Z Running codegen 2023-09-22T00:40:46Z codegen finished after 16.720 seconds 2023-09-22T00:40:46Z Running neff_packager 2023-09-22T00:40:47Z neff_packager finished after 0.498 seconds

2023-09-22T00:41:45Z Wrote /tmp/tmpyii81wzh/graph.neff 2023-09-22T00:41:48Z Compiler status PASS Segmentation Fault

Model.pt is generated but its size is 0Kb

aws-qing commented 1 year ago

Hi,

Thank you, we believe this is a known issue with saving large models and are looking into it.

danishboy000 commented 1 year ago

Thanks, please let me know, where you fix this, because I'm eagerly looking to deploy this.

aws-bhegedus commented 1 month ago

Closing this issue as the script has been passing on recent Neuron releases