alimama-creative / M3DDM-Video-Outpainting

Official repo for Hierarchical Masked 3D Diffusion Model for Video Outpainting
https://fanfanda.github.io/M3DDM/
Apache License 2.0
83 stars 6 forks source link

example input videos #2

Open jimmyl02 opened 10 months ago

jimmyl02 commented 10 months ago

Hello,

Thanks for the amazing work! I was wondering if you have some sample videos that we could use with the inference script to play around with. I haven't had the chance to read through the code yet but I'm guessing the model was trained with a certain aspect ratio and resolution. Would love to be able to recreate some of the examples from the landing page.

Thanks! Would appreciate any insights.

fanfanda commented 10 months ago

Our M3DDM can take input videos of any resolution and any target_ratio_list, but the output video will have its longest edge at a resolution of 256. This is due to our training dataset being resized to a 256x256 resolution. Furthermore, the input target_ratio_list must be different from the aspect ratio of the input video, as this is what makes the process meaningful. You can download videos from the YouTube-VOS dataset and use our script with a 1:1 target_ratio_list to extend the video. Below is an example created using the video 13006c4c7e.mp4 from YouTube-VOS, with a 1:1 target_ratio_list.

Source Video Output Result
Video1 genvideo
iGerman00 commented 9 months ago

It's a very interesting paper. With some additional work such as masking in post to remove any really annoying flashing and temporal inconsistency, color correction to match the original high-resolution video, as well as an upscale with a model like Real-ESRGAN or Topaz' products, I was able to composite an extension for an iconic 4:3 video to 16:9. I used the pre-trained weights provided in the README. Took very long on a 3090, though, about 10 hours for this video.

https://github.com/alimama-creative/M3DDM-Video-Outpainting/assets/36676880/f64faede-6bc3-41a3-9523-abc96dac9b7e

fanfanda commented 9 months ago

Wow, this is a really nice result! Thanks for your contributions.

We are also continuing to optimize our video outpainting model. In the future, we will support output for high-resolution videos without the need for subsequent super-resolution models. Regarding inference speed, currently you can manually remove the branch for global video frames, or use a stride of [15, 1] instead of [15, 5, 1] for a speed-up, though be aware that this might result in some loss of quality. Alternatively, you could use existing distillation algorithms to speed up our model.