THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.56k stars 906 forks source link

Improving quality of Image to Video generation CogVideoX1.5-5B #568

Open AvisP opened 2 days ago

AvisP commented 2 days ago

I was wondering if there is any special way to improve the quality of video generated form static images. I followed the suggestion provided on the example to prompt with LLMs. For the static image shown below I provided it with a text prompt that I got from chatgpt "A dense canopy of vibrant green leaves gently sways in the breeze, their edges fluttering and rustling softly. The wind moves through the branches, causing the leaves to tremble and shimmer with a life of their own. As the wind picks up, the rustling grows louder, filling the air with a natural melody. The sunlight filters through the leaves, casting playful shadows on the ground below, while the occasional gust of wind causes a few leaves to break free, twirling and spiraling before landing softly."

Tree

The resulting video I got with inference_step at 20, guidance_scale at 6, num_inference_steps at 49 is the following

https://github.com/user-attachments/assets/c11d12bf-339d-4f53-88ac-93fb0d4f136e

Can you provide some suggestion as to how to improve the quality of the video. I can upscale the video and improve the resolution but to improve the dynamics of the image is what I am interested in. Thanks!

zRzRzRzRzRzRzR commented 2 days ago

Please read the readme carefully. I'm not sure if you changed your frame count to 81, and I noticed that you only used 20 steps instead of the official 50 steps, which has a significant impact on the results. A resolution of 720 * 480 is acceptable.

zRzRzRzRzRzRzR commented 2 days ago

720 480 is suitable for CogVideoX1.5-5B-I2V, if it is T2V, it should be 1260 768.

AvisP commented 1 day ago

Thanks for your comments. so I kept the height as 480 and width as 720 and this is what I got

Guidance scale : 6.0 and Inference step : 50

https://github.com/user-attachments/assets/6c657695-9a6b-461d-9336-248daa48ef93

Guidance scale : 6.0 and Inference step : 20

https://github.com/user-attachments/assets/a9d115e5-34eb-4468-964c-ad472a15eaab

It seems to be kind of doing what is described in the prompt and changes the camera angle and zooms in. but don't see noticeable improvement with number of inference steps. But if I am missing out on something please let me know. Following is the code for reference

pipe = CogVideoXImageToVideoPipeline.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

se_dynamic_cfg_flag = True
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

video_generate = pipe(
    height=480,
    width=720,
    prompt=prompt,
    image=image,
    num_videos_per_prompt=1,  # Number of videos to generate per prompt
    num_inference_steps=50,  # Number of inference steps
    num_frames=81,  # Number of frames to generate
    use_dynamic_cfg=use_dynamic_cfg_flag,  # This id used for DPM scheduler, for DDIM scheduler, it should be False
    guidance_scale=6.0,
    generator=torch.Generator().manual_seed(42),  # Set the seed for reproducibility
).frames[0]

output_file_path =  Path.joinpath(output_path, 'output.mp4')
export_to_video(video_generate, output_file_path, fps=8)
zRzRzRzRzRzRzR commented 1 day ago

Change the frame rate from fps=8 to fps=16 in the export_to_video function

Nevertheless, in this scenario and with this prompt word, the effect is indeed not arbitrary, and this issue is recorded as a bad case. Will this problem occur with any replacement of the prompt word?