Anima-Lab / MaskDiT

Code for Fast Training of Diffusion Models with Masked Transformers
MIT License
377 stars 14 forks source link

Discrepancies in training speed ratios #18

Open arghavan-kpm opened 3 months ago

arghavan-kpm commented 3 months ago

Hi, In the paper Fig 4 (256x256 diffusion models), the ratio between training speed (steps/sec) of MaskDiT to DiT for bs=256 and bs=1024 is different (~60% and ~73% higher, respectively). I also tried the code for bs=16 (256x256, single GPU) and got 1.24 vs 1.19 (4%) for MaskDiT and DiT(no decoder, mask_ratio=0%), respectively. I was wondering what is the reason behind these discrepancies. Thanks.

devzhk commented 3 months ago

Hi,

We never tried it on BS16. I think that with a large batch size, the primary computational cost comes from the forward and backward passes, which masked training can effectively reduce. If the batch size is small, this part of the cost will be much smaller. Also, we use the official DiT repo when reporting the numbers. For consistency, you may use the official DiT implementation.