alimama-creative / M3DDM-Video-Outpainting

Official repo for Hierarchical Masked 3D Diffusion Model for Video Outpainting
https://fanfanda.github.io/M3DDM/
Apache License 2.0
68 stars 5 forks source link

example input videos #2

Open jimmyl02 opened 5 months ago

jimmyl02 commented 5 months ago

Hello,

Thanks for the amazing work! I was wondering if you have some sample videos that we could use with the inference script to play around with. I haven't had the chance to read through the code yet but I'm guessing the model was trained with a certain aspect ratio and resolution. Would love to be able to recreate some of the examples from the landing page.

Thanks! Would appreciate any insights.

fanfanda commented 5 months ago

Our M3DDM can take input videos of any resolution and any target_ratio_list, but the output video will have its longest edge at a resolution of 256. This is due to our training dataset being resized to a 256x256 resolution. Furthermore, the input target_ratio_list must be different from the aspect ratio of the input video, as this is what makes the process meaningful. You can download videos from the YouTube-VOS dataset and use our script with a 1:1 target_ratio_list to extend the video. Below is an example created using the video 13006c4c7e.mp4 from YouTube-VOS, with a 1:1 target_ratio_list.

Source Video Output Result
Video1 genvideo
iGerman00 commented 5 months ago

It's a very interesting paper. With some additional work such as masking in post to remove any really annoying flashing and temporal inconsistency, color correction to match the original high-resolution video, as well as an upscale with a model like Real-ESRGAN or Topaz' products, I was able to composite an extension for an iconic 4:3 video to 16:9. I used the pre-trained weights provided in the README. Took very long on a 3090, though, about 10 hours for this video.

https://github.com/alimama-creative/M3DDM-Video-Outpainting/assets/36676880/f64faede-6bc3-41a3-9523-abc96dac9b7e

fanfanda commented 5 months ago

Wow, this is a really nice result! Thanks for your contributions.

We are also continuing to optimize our video outpainting model. In the future, we will support output for high-resolution videos without the need for subsequent super-resolution models. Regarding inference speed, currently you can manually remove the branch for global video frames, or use a stride of [15, 1] instead of [15, 5, 1] for a speed-up, though be aware that this might result in some loss of quality. Alternatively, you could use existing distillation algorithms to speed up our model.