THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.56k stars 904 forks source link

Found black and white video when sampling CogVideoX1.5-5B 10s using code from huggingface. #578

Open DZY-irene opened 1 day ago

DZY-irene commented 1 day ago

System Info / 系統信息

None

Information / 问题信息

Reproduction / 复现过程

Thank you for your wonderful work! I'm using vbench's gpt enhanced prompt for samples, and I've noticed that occasionally a couple of videos will have black or white output. Or for a long period of time the video is black and objects appear at the last second.

prompt: A focused individual sits at a sleek, modern desk in a dimly lit room, illuminated by the soft glow of a high-resolution computer screen. They wear a cozy, oversized sweater and glasses, reflecting the screen's light. The room is filled with the quiet hum of technology, with a minimalist setup including a mechanical keyboard and a wireless mouse. The person’s fingers dance swiftly across the keys, their face showing intense concentration. Behind them, a bookshelf filled with colorful books and a potted plant adds a touch of warmth to the tech-centric space. The scene captures the blend of human focus and digital interaction.

https://github.com/user-attachments/assets/9e1a915f-87fa-456c-90aa-b1d40bb83ace

prompt: A pristine white bathroom features a sleek, modern sink with a chrome faucet, set against a backdrop of glossy white tiles. The sink's surface is adorned with a neatly folded hand towel and a small potted plant, adding a touch of greenery. Adjacent to the sink, a contemporary toilet with a soft-close lid and a minimalist design stands out. The toilet's clean lines and the subtle sheen of its ceramic surface reflect the ambient light. The scene captures the essence of a serene, well-maintained bathroom, emphasizing cleanliness and modern aesthetics.

https://github.com/user-attachments/assets/ad79cbc5-7c06-4016-82f9-50cf0e65a709

prompt: A sleek black cat with piercing green eyes prowls gracefully through a dimly lit, mysterious alleyway, its fur glistening under the soft glow of a distant streetlamp. The cat pauses, ears perked, as it senses movement, its silhouette casting an elongated shadow on the cobblestone path. It then leaps effortlessly onto a nearby windowsill, where it sits, tail flicking, and gazes intently into the darkness. The scene transitions to a close-up of the cat's face, highlighting its sharp, alert features and the subtle twitch of its whiskers, capturing the essence of its enigmatic and nocturnal nature.

https://github.com/user-attachments/assets/f2f15612-eca7-4b72-976e-38b4222c493b

Here is my code:

import os
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt_file = 'test_human.txt'
with open(prompt_file, "r") as f:
    prompts = [line.strip() for line in f if line.strip()]

prompt_file_longer='test_human_longer.txt'
with open(prompt_file_longer, "r") as f:
    prompts_longer = [line.strip() for line in f if line.strip()]

output_dir = prompt_file.split('/')[-1].split('.')[0] 
os.makedirs(output_dir, exist_ok=True)

pipe = CogVideoXPipeline.from_pretrained(
    "PATH",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()

for i,prompt_l in enumerate(prompts_longer):
    prompt = prompts[i]

    for num in range(5):  
        generator = torch.Generator(device="cuda").manual_seed(42 + num)
        video = pipe(
            prompt=prompt_l,
            height=768,
            width=1360,
            num_videos_per_prompt=1,
            num_inference_steps=50,
            num_frames=81,
            guidance_scale=6,
            generator=generator,
        ).frames[0]

        output_path = os.path.join(output_dir, f"{prompt}-{num}.mp4")
        export_to_video(video, output_path, fps=8)
        print(f"Video saved to {output_path}")

The "test_human.txt" is test_human.txt

"test_human_longer.txt" is test_human_longer.txt

The prompt in the file is the one that is likely to have a black video.

Expected behavior / 期待表现

To figure out why this is happening.

zRzRzRzRzRzRzR commented 13 hours ago

emm, num_frames needs to be changed to 161, export_to_video(video, output_path, fps=16). Also, have you tried whether five seconds is normal.

DZY-irene commented 10 hours ago

emm, num_frames needs to be changed to 161, export_to_video(video, output_path, fps=16). Also, have you tried whether five seconds is normal.

I want to make sure that the setting for 5 seconds video is num_frames=81 and export_to_video(video,output_path,fps=16)? And for 10 seconds is num_frames=161 and export_to_video(video,output_path,fps=16)? I found that CogVideoX1.5's frame rates are all 16fps, but the fps setting for export_to_video in huggface's demo is 8. 1733290244557

DZY-irene commented 10 hours ago

prompt: A focused individual sits at a sleek, modern desk in a dimly lit room, illuminated by the soft glow of a high-resolution computer screen. They wear a cozy, oversized sweater and glasses, reflecting the screen's light. The room is filled with the quiet hum of technology, with a minimalist setup including a mechanical keyboard and a wireless mouse. The person’s fingers dance swiftly across the keys, their face showing intense concentration. Behind them, a bookshelf filled with colorful books and a potted plant adds a touch of warmth to the tech-centric space. The scene captures the blend of human focus and digital interaction. Setting 81 frames and 16fps for 5-sec video output:

https://github.com/user-attachments/assets/2ca4bf5a-5cc7-4c62-be11-b077f7018ec0

Setting 161 frames and 16fps for 10-sec video output:

https://github.com/user-attachments/assets/59f59a19-81f3-4c3e-9007-ccada6652d84

By the way, when using the SAT version for 5-sec video sampling, everything goes well, and there is no black and white video. I suppose the diffuser version may still make things bad.