chengzeyi / stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
MIT License
1.16k stars 71 forks source link

Perf regression on A100 in v1.0.0+torch212+cu121+xformers0.23.post1 v.s. 0.0.13+torch2.0.0+cu121+xformers0.22patch7 #88

Open jon-chuang opened 9 months ago

jon-chuang commented 9 months ago

lcm: 18.5ms -> 25.0ms.

Same story with v1.0.0+torch2.1.1+cu121+xformers0.23 nightly release is even worse: (30ms)

When I use v.1.0.0 with torch 2.1.2 and xformers0.23.post1, I do not observe this issue. So the issue is with stable-fast.

Perf is worse even with 0.0.13 not compiling vae.encode (20.5ms).

Similar regression is observed for H100.

chengzeyi commented 9 months ago

This shouldn't happen. What's your script?

chengzeyi commented 9 months ago

When I run python3 examples/optimize_lcm_lora.py, I still see a significant speedup improvement. So I don't know what's wrong.