In the techniques, I found GroupNorm was effective to speed up. But when I tested the speed between triton and torch operations, I found triton was slower than torch. The script is the same as https://github.com/chengzeyi/stable-fast/blob/main/src/sfast/triton/ops/group_norm.py , and the result is shown below.
For the GroupNorm operation, whether triton is faster than torch. Can I replace the torch.nn.GroupNorm with TritonGroupNorm directly, to accelerate stable diffusion model.
My env is
A100 GPU, torch 2.1, triton 2.1, no xformers, diffusers 0.21.2
@chenly15 Our implementation may be not super effecient now. I currently use other methods to speed up groupnorm computation. However, it has not been open-sourced by far.
In the techniques, I found GroupNorm was effective to speed up. But when I tested the speed between triton and torch operations, I found triton was slower than torch. The script is the same as https://github.com/chengzeyi/stable-fast/blob/main/src/sfast/triton/ops/group_norm.py , and the result is shown below. For the GroupNorm operation, whether triton is faster than torch. Can I replace the torch.nn.GroupNorm with TritonGroupNorm directly, to accelerate stable diffusion model.
My env is A100 GPU, torch 2.1, triton 2.1, no xformers, diffusers 0.21.2