Triton group normalization is slower than torch ops in some case.

In the techniques, I found GroupNorm was effective to speed up. But when I tested the speed between triton and torch operations, I found triton was slower than torch. The script is the same as https://github.com/chengzeyi/stable-fast/blob/main/src/sfast/triton/ops/group_norm.py , and the result is shown below. For the GroupNorm operation, whether triton is faster than torch. Can I replace the torch.nn.GroupNorm with TritonGroupNorm directly, to accelerate stable diffusion model.

My env is A100 GPU, torch 2.1, triton 2.1, no xformers, diffusers 0.21.2

chengzeyi / stable-fast

Triton group normalization is slower than torch ops in some case. #134