Open arghavan-kpm opened 3 months ago
Hi,
We never tried it on BS16. I think that with a large batch size, the primary computational cost comes from the forward and backward passes, which masked training can effectively reduce. If the batch size is small, this part of the cost will be much smaller. Also, we use the official DiT repo when reporting the numbers. For consistency, you may use the official DiT implementation.
Hi, In the paper Fig 4 (256x256 diffusion models), the ratio between training speed (steps/sec) of MaskDiT to DiT for bs=256 and bs=1024 is different (~60% and ~73% higher, respectively). I also tried the code for bs=16 (256x256, single GPU) and got 1.24 vs 1.19 (4%) for MaskDiT and DiT(no decoder, mask_ratio=0%), respectively. I was wondering what is the reason behind these discrepancies. Thanks.