Recently video generation has achieved substantial progress with realisticresults. Nevertheless, existing AI-generated videos are usually very shortclips ("shot-level") depicting a single scene. To deliver a coherent long video("story-level"), it is desirable to have creative transition and predictioneffects across different clips. This paper presents a short-to-long videodiffusion model, SEINE, that focuses on generative transition and prediction.The goal is to generate high-quality long videos with smooth and creativetransitions between scenes and varying lengths of shot-level videos.Specifically, we propose a random-mask video diffusion model to automaticallygenerate transitions based on textual descriptions. By providing the images ofdifferent scenes as inputs, combined with text-based control, our modelgenerates transition videos that ensure coherence and visual quality.Furthermore, the model can be readily extended to various tasks such asimage-to-video animation and autoregressive video prediction. To conduct acomprehensive evaluation of this new generative task, we propose threeassessing criteria for smooth and creative transition: temporal consistency,semantic similarity, and video-text semantic alignment. Extensive experimentsvalidate the effectiveness of our approach over existing methods for generativetransition and prediction, enabling the creation of story-level long videos.Project page: https://vchitect.github.io/SEINE-project/ .
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)