facebookresearch / DiT

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Other
6.37k stars 569 forks source link

[Question Again] Why DiT-XL/2 takes 119 GFlops to generate 256x256 images? #99

Open zheweijushi opened 2 months ago

zheweijushi commented 2 months ago

According to Issue #67 , it can be inferred that when B=1, approximately 119 GFLOPs are needed (in fact, this represents the number of MACs).

However, in the DiT code, when calculating attention, it seems that an additional B empty classes are added to the calculation.

image

Therefore, when estimating the computational load of the DiT block, it should be calculated as B*2.

Generating a 256x256 image should require 1 (B=1) 2 (adding empty classes) 119 GFLOPs ?

Is there a problem with my understanding? I hope you can answer this question. Thank you very much

zheweijushi commented 2 months ago

@wpeebles @s9xie @ictzyqq @void-main could you please kindly take a look, thank you very much!