aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications
Other
101 stars 32 forks source link

unet compile failed at hf_pretrained_sdxl_base_1024_inference #42

Open eric80116 opened 9 months ago

eric80116 commented 9 months ago

When executed "hf_pretrained_sdxl_base_1024_inference" then process will failed at "torch.jit.save(unet_neuron, unet_filename)" and the kernel will dead to save the file.

aws-qing commented 9 months ago

Hi @eric80116,

Thank you for raising this issue. I would like to confirm a few things:

  1. Can you make sure you have enough space on your device? The model.pt file is around 3.9GiB in size
  2. Are you using an instance of inf2.8xlarge, trn1 or bigger? Note that you cannot compile and run SDXL models on inf2.xlarge
  3. What is the compiler version you're using? We only support SDXL starting at release 2.13, which corresponds to compiler version 2.9. Can you make sure you have the latest compiler (from release 2.14)?
  4. Based on your description, it sounds like compilation succeeds. Is this correct? To confirm, check that you have Compiler status PASS in your log-neuron-cc.txt file. If not, can you paste the log file?

Thanks, Qing

eric80116 commented 9 months ago

Hi Qing @aws-qing ,

  1. Yes, I had enough space on instance

    Screenshot 2023-09-23 at 6 09 25 AM
  2. Inf2.8xlarge on EC2

  3. I used compiler version 2.9, not sure how to check the neuron-sdk version but it shows 2.14.6 on nccom-test version

    Screenshot 2023-09-23 at 6 56 54 AM
  4. yes, the log as attached. You can see the unet compile successfully at 2023-09-22T23:55:30Z, but fail and cause kernel dead when saving the file. You can see the message "KernelRestarter: restarting kernel (1/5), keep random ports" in the log. 202309230700.log

  5. I also check the /var/log/message and found python segfault message when saving the file, hope it helps for tracing the root cause.

    Screenshot 2023-09-23 at 7 58 04 AM