This pull request updates the the notebook that demonstrates how to compile and run the HuggingFace Stable Diffusion 1.5 (512x512) model for accelerated inference on Neuron (located in torch-neuronx/inference/hf_pretrained_sd15_512_inference.ipynb).
The main purpose of this update is to provide more context on the challenges of compiling and running the SD 1.5 pipeline on Neuron. Legacy code has been refactored. Latency numbers of both original and updated notebooks are comparable: ~2.4s/image with num_inference_steps=50.
Main updates:
HF dependencies have been upgraded from diffusers==0.14.0 transformers==4.26.1 accelerate==0.16.0 to diffusers==0.20.0 transformers==4.31.0 accelerate==0.21.0.
An additional model is being compiled and run on Neuron: the safety checker's visual projection layer.
We make explicit that the compiled models can actually be used with negative prompts and provide examples in the section dedicated to running the pipeline.
A new section has been introduced before model compilation to log the inputs and outputs of the forward method of the models the user will compile and run on Neuron. This helps to understand the implementation of each wrapper function.
Wrappers for the text encoder, U-Net & safety models have been refactored into decorator functions and we explicitly distinguished between compile time and runtime wrappers/decorators.
Shapes of sample tensors used for compilation have been linked to the SD pipeline's configuration and to the generation configuration.
Lazy & async loading have been enabled for all models, a warmup inference has therefore been added at the end of the model loading section.
Attention monkey-patching has been modified to adapt to the removal of diffusers.models.cross_attention in version 0.20.0.
A clean-up section has been added at the end of the notebook.
Notice: The get_attention_scores we monkey-patch into the attention processor is actually key. Not monkey-patching increases latency dramatically. I was not able to totally figure out by myself why the provided implementation brings so much performance gains. Considering its importance, it would be greatly beneficial to provide additional context on the chosen implementation.
This pull request updates the the notebook that demonstrates how to compile and run the HuggingFace Stable Diffusion 1.5 (512x512) model for accelerated inference on Neuron (located in
torch-neuronx/inference/hf_pretrained_sd15_512_inference.ipynb
).The main purpose of this update is to provide more context on the challenges of compiling and running the SD 1.5 pipeline on Neuron. Legacy code has been refactored. Latency numbers of both original and updated notebooks are comparable: ~2.4s/image with
num_inference_steps=50
.Main updates:
diffusers==0.14.0 transformers==4.26.1 accelerate==0.16.0
todiffusers==0.20.0 transformers==4.31.0 accelerate==0.21.0
.forward
method of the models the user will compile and run on Neuron. This helps to understand the implementation of each wrapper function.diffusers.models.cross_attention
in version0.20.0
.Notice: The
get_attention_scores
we monkey-patch into the attention processor is actually key. Not monkey-patching increases latency dramatically. I was not able to totally figure out by myself why the provided implementation brings so much performance gains. Considering its importance, it would be greatly beneficial to provide additional context on the chosen implementation.