chengzeyi / stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
MIT License
1.05k stars 59 forks source link

Triton group normalization is slower than torch ops in some case. #134

Open chenly15 opened 3 months ago

chenly15 commented 3 months ago

In the techniques, I found GroupNorm was effective to speed up. But when I tested the speed between triton and torch operations, I found triton was slower than torch. The script is the same as https://github.com/chengzeyi/stable-fast/blob/main/src/sfast/triton/ops/group_norm.py , and the result is shown below. For the GroupNorm operation, whether triton is faster than torch. Can I replace the torch.nn.GroupNorm with TritonGroupNorm directly, to accelerate stable diffusion model.

截屏2024-03-12 17 51 00

My env is A100 GPU, torch 2.1, triton 2.1, no xformers, diffusers 0.21.2

chengzeyi commented 1 month ago

@chenly15 Our implementation may be not super effecient now. I currently use other methods to speed up groupnorm computation. However, it has not been open-sourced by far.