Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model, using only around 30% of its original training time.
Do you think it is worth considering adapting their code to the existing Latte model?
We attempted to incorporate long skip connection introduced by MDTv2 and mask strategy introduced by MaskDiT during training on webvid-10m. Although it didn't yield noticeable improvements, the most significant advantage was the acceleration of convergence.
This repo: Fast Training of Diffusion Models with Masked Transformers suggests using masked transformers architecture for faster DiT training. They claim that
Do you think it is worth considering adapting their code to the existing Latte model?