Questions on DiT-based video generation backbones

gwang-kim commented 1 month ago

Hi,

Thank you for the great work! Also, your analysis of the performance differences between DynamiCrafter and SVD backbones for Latte is very insightful.

I'd be interested to learn more about how the choice of Latte/Hotshot-XL backbone impacts the quality of video outputs. Additionally, could you provide an estimated timeline for releasing the pre-trained models and code for the DiT/Hotshot-XL-based Ctrl-Adapter?

Thank you in advance for your time.

HL-hanlin commented 1 month ago

Hi Bradley,

Thank you for your interest in our work!

That's a great question! To be honest, we've also made some effort to determine whether a DiT-based backbone (e.g., Latte) is better or worse than a U-Net based backbone (e.g., HotshotXL). However, we haven't arrived at a convincing conclusion yet, as it's challenging to make a fair comparison between these two backbones due to differences in model architecture, training data, scheduler, etc.

After experimenting with four video generation backbones (Hotshot-XL, I2VGenXL, SVD, Latte), here are some observations we've made that might provide insights for future work:

Different backbone models may fail or perform poorly in different cases. For example, we've observed that SVD is better at generating slide motions and camera movements and can maintain good color-style consistency between the first frame and generated frames. On the other hand, I2VGen-XL is better at generating more complex motions, but the generated frames usually have some slight color distortion.
T2V models (Hotshot-XL, Latte) are usually easier to control, while I2V models (I2VGen-XL, SVD) generally have better visual quality. This may be due to two reasons: Firstly, T2V models are typically trained with lower fps compared to I2V models. Secondly, I2V models usually incorporate information from the first frame into the following generated frames for consistency control, while T2V models do not have such constraints.
If we only compare Hotshot-XL with Latte, the generated videos are usually better with Latte. However, once again, it's challenging for us to conclude that a DiT-based backbone is superior to a U-Net based backbone due to the reasons mentioned above.
We have conducted some initial experiments in our paper and found that matching the same feature map sizes between U-Net and DiT blocks usually achieves the best spatial control and visual quality. Additionally, we found that interleavingly inserting feature maps into DiT-blocks is a good strategy to save computational costs.

Feel free to contact me if you're interested in exploring further how we should compare U-Net and DiT-based backbones in a fair way.

Regarding the code, I plan to clean up the code for Latte, PixArt-Alpha and the video editing task and release it in 2 weeks. Thanks!

gwang-kim commented 1 month ago

Thank you for sharing your great insights! Those are really helpful. Look forward to seeing the new codes.

HL-hanlin / Ctrl-Adapter

Questions on DiT-based video generation backbones #16