Open eric80116 opened 9 months ago
Hi @eric80116,
Thank you for raising this issue. I would like to confirm a few things:
Compiler status PASS
in your log-neuron-cc.txt
file. If not, can you paste the log file?Thanks, Qing
Hi Qing @aws-qing ,
Yes, I had enough space on instance
Inf2.8xlarge on EC2
I used compiler version 2.9, not sure how to check the neuron-sdk version but it shows 2.14.6 on nccom-test version
yes, the log as attached. You can see the unet compile successfully at 2023-09-22T23:55:30Z, but fail and cause kernel dead when saving the file. You can see the message "KernelRestarter: restarting kernel (1/5), keep random ports" in the log. 202309230700.log
I also check the /var/log/message and found python segfault message when saving the file, hope it helps for tracing the root cause.
When executed "hf_pretrained_sdxl_base_1024_inference" then process will failed at "torch.jit.save(unet_neuron, unet_filename)" and the kernel will dead to save the file.