unet compile failed at hf_pretrained_sdxl_base_1024_inference

eric80116 commented 9 months ago

When executed "hf_pretrained_sdxl_base_1024_inference" then process will failed at "torch.jit.save(unet_neuron, unet_filename)" and the kernel will dead to save the file.

aws-qing commented 9 months ago

Hi @eric80116,

Thank you for raising this issue. I would like to confirm a few things:

Can you make sure you have enough space on your device? The model.pt file is around 3.9GiB in size
Are you using an instance of inf2.8xlarge, trn1 or bigger? Note that you cannot compile and run SDXL models on inf2.xlarge
What is the compiler version you're using? We only support SDXL starting at release 2.13, which corresponds to compiler version 2.9. Can you make sure you have the latest compiler (from release 2.14)?
Based on your description, it sounds like compilation succeeds. Is this correct? To confirm, check that you have Compiler status PASS in your log-neuron-cc.txt file. If not, can you paste the log file?

Thanks, Qing

eric80116 commented 9 months ago

Hi Qing @aws-qing ,

Yes, I had enough space on instance
Inf2.8xlarge on EC2
I used compiler version 2.9, not sure how to check the neuron-sdk version but it shows 2.14.6 on nccom-test version
yes, the log as attached. You can see the unet compile successfully at 2023-09-22T23:55:30Z, but fail and cause kernel dead when saving the file. You can see the message "KernelRestarter: restarting kernel (1/5), keep random ports" in the log. 202309230700.log
I also check the /var/log/message and found python segfault message when saving the file, hope it helps for tracing the root cause.

aws-neuron / aws-neuron-samples

unet compile failed at hf_pretrained_sdxl_base_1024_inference #42