THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.07k stars 857 forks source link

the last few seconds of the video are static. #282

Open Yuheng-Feng opened 2 months ago

Yuheng-Feng commented 2 months ago

System Info / 系統信息

diffusers 0.30.2

Information / 问题信息

Reproduction / 复现过程

python cli_demo.py --prompt "The video depicts two individuals engaged in a conversation in what appears to be a professional or institutional setting. The person on the left is dressed in a white lab coat, suggesting they might be a medical professional, and is holding a smartphone. The person on the right is wearing a green jacket and has their hair tied back, carrying a shoulder bag, and appears to be listening attentively to the person in the lab coat. The background features a door with a sign indicating it leads to an executive conference room, reinforcing the professional context of the scene. The interaction between the two individuals seems to be focused and serious, as indicated by their body language and the setting."

Expected behavior / 期待表现

In some generated cases, the last few seconds of the video are static. Why is that? "caption": "The video depicts two individuals in a medical setting, likely a hospital. The person on the left is a bald man wearing a white lab coat over a dark blue shirt, with a name tag and a badge visible on his coat. He appears to be in mid-conversation or reacting to something, with a serious expression on his face. The person on the right is a woman with curly hair, wearing a white lab coat over a black top with a pattern. She is also looking in the direction of the man, with a concerned or attentive expression.\n\nThey are standing in front of a sign that reads \"4913 EXECUTIVE CONFERENCE ROOM,\" indicating they are near an executive meeting room within the hospital. The background shows a window with blinds partially closed, allowing some natural light to enter the room. The overall atmosphere suggests a professional and serious environment, possibly involving a discussion or an important event related to their work."

https://github.com/user-attachments/assets/a348cd0c-ce06-41f7-8afa-45a8d98ce0d6

 "caption": "The video depicts two individuals engaged in a conversation in what appears to be a professional or institutional setting. The person on the left is dressed in a white lab coat, suggesting they might be a medical professional, and is holding a smartphone. The person on the right is wearing a green jacket and has their hair tied back, carrying a shoulder bag, and appears to be listening attentively to the person in the lab coat. The background features a door with a sign indicating it leads to an executive conference room, reinforcing the professional context of the scene. The interaction between the two individuals seems to be focused and serious, as indicated by their body language and the setting."

https://github.com/user-attachments/assets/bdba2339-346a-4694-b779-fa6bd234b20e

zRzRzRzRzRzRzR commented 2 months ago

We have reproduced this situation, currently some seeds are normal (such as 84) but seed 42 is indeed having issues, we will take a look

tin2tin commented 1 month ago

Often the first 2-3 frames are also frozen.