optimize perfermance of fused GEGLU

During benchmarking SDXL model on A10, I found nearly 25% of the time is spent on computing fused GEGLU. Although the fused GEGLU kernel in stable-fast is already faster than unfused implementations, it might still have room to improve.

So I optimize the kernel ThreadBlockSize and implement a faster GELU function.

Before optimizing, on A10, speed is less than 4 it/s with 1024x1024. After optimizing, it is now 4.2 it/s.

python3 examples/optimize_stable_diffusion_pipeline.py --model stabilityai/stable-diffusion-xl-base-1.0 --height 1024 --width 1024 --seed 0

chengzeyi / stable-fast

optimize perfermance of fused GEGLU #74