[Question Again] Why DiT-XL/2 takes 119 GFlops to generate 256x256 images?

According to Issue #67 , it can be inferred that when B=1, approximately 119 GFLOPs are needed (in fact, this represents the number of MACs).

However, in the DiT code, when calculating attention, it seems that an additional B empty classes are added to the calculation.

Therefore, when estimating the computational load of the DiT block, it should be calculated as B*2.

Generating a 256x256 image should require 1 (B=1) 2 (adding empty classes) 119 GFLOPs ?

Is there a problem with my understanding? I hope you can answer this question. Thank you very much

facebookresearch / DiT