Open jon-chuang opened 10 months ago
Thanks. That's really cool. I am going to check that. In the last week I have improved stable-fast
a lot so this performance gain on A100 should be true. Benchmark results will be updated soon😄.
@chengzeyi let me know if you need help to match the reported results. I had a few tweaks to torch.compile
to improve the unet inference and overall pipeline.
@chengzeyi let me know if you need help to match the reported results. I had a few tweaks to
torch.compile
to improve the unet inference and overall pipeline.
😯So could you show how you tweak torch.compile
to improve its speed and your results of it?
Just some settings. It may not help that much.
with torch._inductor.config.patch({'layout_optimization': True}):
model = torch.compile(..)
model(..)
For pipeline I also compiled encoder and decoder.
@jon-chuang A100 40GB results are now updated. But some numbers are missing. Could you provide numbers for SD Controlnet and SDXL?
Framework | SD 1.5 | SD 2.1 | SD 1.5 ControlNet | SD XL |
---|---|---|---|---|
Vanilla PyTorch (2.1.0+cu118) | 23.8 it/s | 23.8 it/s | 15.7 it/s | 10.0 it/s |
torch.compile (2.1.0+cu118, NHWC UNet) | 37.7 it/s | 42.7 it/s | 24.7 it/s | 20.9 it/s |
Stable Fast (with xformers & Triton) | 58.0 it/s | outdated | outdated | outdated |
Will work on it.
Will work on it.
Ok, I think I have made a mistake, you are mentioning the speed of torch.compile
, not stable-fast
😮. I haven't tried torch nightly yet. So what's its speed with and without torch.compile
and compared with TRT
and stable-fast
?
And I have found that setting 'layout_optimization': True
is not necessary. It is the default config of torch.compile
Oh yes. But I want to try this one now: https://github.com/pytorch/pytorch/blob/937d616e825e70b8d786c1f514ae9cec9c8d4ee9/torch/_inductor/config.py#L223
Set to False
@jon-chuang Accurate iterations per second can be measured by CUDA Event
now.
https://github.com/chengzeyi/stable-fast/blob/06b70e63dd98ee3865cfcc9a9786066166701207/examples/optimize_stable_diffusion_pipeline.py#L119
58 Iterations per second. (with
torch.cuda.synchronize
)Settings:
Wall clock time for full pipeline: 862ms. (no
torch.cuda.synchronize
). 927ms (withtorch.cuda.synchronize
)torch version:
torch==2.2.0.dev20231203+cu121
Also, please update wallclock time and iteration/s for other GPUs on torch nightly. If extrapolation serves us right, it ought to be faster than TensorRT.
Outdated
``` 2023-12-04 03:24:33.549 [stderr ] 100%|██████████| 50/50 [00:00<00:00, 79.00it/s] ``` 79 Iterations per second. (naive measurement) Wall clock time for full pipeline: 862ms.torch eager and not channels first: wallclock time: 1763ms
(naive measurement)