CogVideoX-5B may generate empty videos in some prompts

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

8.4k stars 803 forks source link

CogVideoX-5B may generate empty videos in some prompts #214

Closed whh258 closed 1 month ago

whh258 commented 2 months ago

System Info / 系統信息

CogVideoX-5B

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Thanks for the open source of the team. The video quality generated by CogVideoX-5B is really good!

However, when using the huggingface diffusers library and default parameters, we encountered an issue where the generated video was empty: for example, prompt="Yellow curtains swaging near a blue sofa" or "Blue ink drops into water and dispersions".

We don't know what caused this, but when we reduced the guidance scale parameter and the text condition, the generated video returned to normal. Can you provide an explanation or solution to avoid this when run a large number of prompts?

Expected behavior / 期待表现

Provide an explanation or solution to avoid empty videos when run a large number of prompts

zRzRzRzRzRzRzR commented 2 months ago

There is a big problem, your prompt is too short. Please carefully read our readme. We need to use long prompts as input, which requires you to use large language models like GPT-4 / GLM-4 to polish and input long prompts. Otherwise, this is a part that the model has not been trained on

yunkchen commented 2 months ago

There is a big problem, your prompt is too short. Please carefully read our readme. We need to use long prompts as input, which requires you to use large language models like GPT-4 / GLM-4 to polish and input long prompts. Otherwise, this is a part that the model has not been trained on

Our prompt:A person wears a white t-shirt and beige pants, holding a donut with pink icing and sprinkles. They bring the donut close to their mouth in several frames. The pink background contrasts with their white and beige clothing and the red-toned donut.

using demo code of huggingface model page: video = pipe( prompt=prompt, num_videos_per_prompt=1, num_inference_steps=50, num_frames=49, guidance_scale=6, generator=torch.Generator(device="cuda").manual_seed(42), ).frames[0]

tin2tin commented 2 months ago

I experienced the same problem with prompts like these:

Drone flying through the Salar de Uyuni reflection, mirror-like salt flat, perspective shot, Toyota Land Cruiser in the distance, clear blue sky, fluffy white clouds, 8K resolution, Hasselblad X1D, 24mm lens, f/11 aperture, 1/30s shutter speed, ISO 400.
Flying over the Great Blue Hole, stunning underwater sinkhole, crystal-clear waters, coral reefs, school of fish swimming below, 6K resolution, GoPro Hero8, wide-angle lens, f/2.8 aperture, 1/125s shutter speed, ISO 100.
Drone flying through the Aurora Borealis, Northern Lights, vibrant green and purple colors, snow-covered mountains, frozen lake, 5K resolution, DJI Mavic 2 Pro, 24mm lens, f/2.8 aperture, 1/30s shutter speed, ISO 1600.

bopan3 commented 1 month ago

I recommend increasing the num_inference_steps to 100, this works for my case.