chengzeyi / stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
MIT License
1.06k stars 60 forks source link

Is stable_fast faster or slower than Huggingface fast_diffusion? #93

Closed ghost closed 5 months ago

ghost commented 6 months ago

https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion They are also using torch.compile with full-graph and some other optimizations. I imagine their compile is much slower, but is their optimized pipeline faster or slower than stable-fast? Can we get bfloat16 and quantization/fusion supported?

SuperSecureHuman commented 5 months ago

Stable fast is faster - its closer to TensorRT speeds actually

https://github.com/chengzeyi/stable-fast#performance-comparison

I believe quantizationis already there (?) https://github.com/chengzeyi/stable-fast#model-quantization

chengzeyi commented 5 months ago

@dsingal0 Stable fast should be faster, and deliver higher generation quality. @SuperSecureHuman Quantization is partially supported in stable-fast, but is not really efficient in speed, unfortuantely. To make it effeicient, some CUDA kernels must be carefully written.