Update torch.compile benchmark on A100 40GB SDv1.5 for torch nightly

chengzeyi / stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.

MIT License

1.16k stars 71 forks source link

Update torch.compile benchmark on A100 40GB SDv1.5 for torch nightly #60

Open jon-chuang opened 10 months ago

jon-chuang commented 10 months ago

100%|██████████| 50/50 [00:00<00:00, 58.00it/s]

58 Iterations per second. (with torch.cuda.synchronize)

Settings:

A100 40GB, fp16, batch size 1
height=512,
width=512,
num_inference_steps=50,
num_images_per_prompt=1,

Wall clock time for full pipeline: 862ms. (no torch.cuda.synchronize). 927ms (with torch.cuda.synchronize)

torch version: torch==2.2.0.dev20231203+cu121

Also, please update wallclock time and iteration/s for other GPUs on torch nightly. If extrapolation serves us right, it ought to be faster than TensorRT.

Outdated

``` 2023-12-04 03:24:33.549 [stderr ] 100%|██████████| 50/50 [00:00<00:00, 79.00it/s] ``` 79 Iterations per second. (naive measurement) Wall clock time for full pipeline: 862ms.

torch eager and not channels first: wallclock time: 1763ms

100%|██████████| 50/50 [00:01<00:00, 29.55it/s]

(naive measurement)

chengzeyi commented 10 months ago

Thanks. That's really cool. I am going to check that. In the last week I have improved stable-fast a lot so this performance gain on A100 should be true. Benchmark results will be updated soon😄.

jon-chuang commented 10 months ago

@chengzeyi let me know if you need help to match the reported results. I had a few tweaks to torch.compile to improve the unet inference and overall pipeline.

chengzeyi commented 10 months ago

@chengzeyi let me know if you need help to match the reported results. I had a few tweaks to torch.compile to improve the unet inference and overall pipeline.

😯So could you show how you tweak torch.compile to improve its speed and your results of it?

jon-chuang commented 10 months ago

Just some settings. It may not help that much.

with torch._inductor.config.patch({'layout_optimization': True}):
    model = torch.compile(..)
    model(..)

For pipeline I also compiled encoder and decoder.

chengzeyi commented 10 months ago

@jon-chuang A100 40GB results are now updated. But some numbers are missing. Could you provide numbers for SD Controlnet and SDXL?

Framework	SD 1.5	SD 2.1	SD 1.5 ControlNet	SD XL
Vanilla PyTorch (2.1.0+cu118)	23.8 it/s	23.8 it/s	15.7 it/s	10.0 it/s
torch.compile (2.1.0+cu118, NHWC UNet)	37.7 it/s	42.7 it/s	24.7 it/s	20.9 it/s
Stable Fast (with xformers & Triton)	58.0 it/s	outdated	outdated	outdated

jon-chuang commented 10 months ago

Will work on it.

chengzeyi commented 10 months ago

Will work on it.

Ok, I think I have made a mistake, you are mentioning the speed of torch.compile, not stable-fast😮. I haven't tried torch nightly yet. So what's its speed with and without torch.compile and compared with TRT and stable-fast?

chengzeyi commented 10 months ago

And I have found that setting 'layout_optimization': True is not necessary. It is the default config of torch.compile

jon-chuang commented 10 months ago

Oh yes. But I want to try this one now: https://github.com/pytorch/pytorch/blob/937d616e825e70b8d786c1f514ae9cec9c8d4ee9/torch/_inductor/config.py#L223

Set to False

chengzeyi commented 9 months ago

@jon-chuang Accurate iterations per second can be measured by CUDA Event now. https://github.com/chengzeyi/stable-fast/blob/06b70e63dd98ee3865cfcc9a9786066166701207/examples/optimize_stable_diffusion_pipeline.py#L119