Open gwang-kim opened 1 month ago
Hi Bradley,
Thank you for your interest in our work!
That's a great question! To be honest, we've also made some effort to determine whether a DiT-based backbone (e.g., Latte) is better or worse than a U-Net based backbone (e.g., HotshotXL). However, we haven't arrived at a convincing conclusion yet, as it's challenging to make a fair comparison between these two backbones due to differences in model architecture, training data, scheduler, etc.
After experimenting with four video generation backbones (Hotshot-XL, I2VGenXL, SVD, Latte), here are some observations we've made that might provide insights for future work:
Feel free to contact me if you're interested in exploring further how we should compare U-Net and DiT-based backbones in a fair way.
Regarding the code, I plan to clean up the code for Latte, PixArt-Alpha and the video editing task and release it in 2 weeks. Thanks!
Thank you for sharing your great insights! Those are really helpful. Look forward to seeing the new codes.
Hi,
Thank you for the great work! Also, your analysis of the performance differences between DynamiCrafter and SVD backbones for Latte is very insightful.
I'd be interested to learn more about how the choice of Latte/Hotshot-XL backbone impacts the quality of video outputs. Additionally, could you provide an estimated timeline for releasing the pre-trained models and code for the DiT/Hotshot-XL-based Ctrl-Adapter?
Thank you in advance for your time.