THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.19k stars 862 forks source link

THUDM/CogVideoX-5b-I2V is producing jibberish output #517

Open nitinmukesh opened 3 hours ago

nitinmukesh commented 3 hours ago

System Info / 系統信息

diffusers 0.32.0.dev0 torch 2.5.1+cu121 torchvision 0.20.1+cu121 python 3.11

Information / 问题信息

Reproduction / 复现过程

python inference/cli_demo.py --prompt "A young girl with sun-kissed hair and sparkling blue eyes stands in a lush, sunlit garden, her face radiant with a genuine smile that lights up her entire being. She wears a soft, floral dress that complements the vibrant blooms around her. As she tilts her head slightly, the sunlight catches the gentle curve of her smile, highlighting her joyful expression. Her hands are gently clasped in front of her, adding to the serene and happy atmosphere. The background is a tapestry of colorful flowers and greenery, enhancing the warmth and beauty of her smile." --model_path THUDM/CogVideoX-5b-I2V --generate_type "i2v" --num_frames 48 --image_or_video_path image.png --width 720 --height 480

https://github.com/user-attachments/assets/c7cd2013-41dd-4c95-9d94-ded741308e21

Expected behavior / 期待表现

I tried the same image here and output was fine

https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space

a-r-r-o-w commented 3 hours ago

num_frames must be 81 or 161. I see it mentioned in the docs here: https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox.

Looks like the table rendering is broken, so will fix that in a follow-up PR