During benchmarking SDXL model on A10, I found nearly 25% of the time is spent on computing fused GEGLU. Although the fused GEGLU kernel in stable-fast is already faster than unfused implementations, it might still have room to improve.
So I optimize the kernel ThreadBlockSize and implement a faster GELU function.
Before optimizing, on A10, speed is less than 4 it/s with 1024x1024. After optimizing, it is now 4.2 it/s.
During benchmarking SDXL model on A10, I found nearly 25% of the time is spent on computing fused GEGLU. Although the fused GEGLU kernel in
stable-fast
is already faster than unfused implementations, it might still have room to improve.So I optimize the kernel ThreadBlockSize and implement a faster GELU function.
Before optimizing, on A10, speed is less than 4 it/s with 1024x1024. After optimizing, it is now 4.2 it/s.
python3 examples/optimize_stable_diffusion_pipeline.py --model stabilityai/stable-diffusion-xl-base-1.0 --height 1024 --width 1024 --seed 0