THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.21k stars 662 forks source link

SAT sampling results are worse than Diffusers sampling results #201

Closed xvjiarui closed 1 week ago

xvjiarui commented 2 weeks ago

System Info / 系統信息

Hi CogVideo Team,

First of all, thank you so much for open-sourcing such great models for community to research text-video generative models.

I tried out both Diffusers and SAT codebase, and I found out the sampling results from SAT are much worse than Diffusers. Here is some example:

  1. Prompt: "A white and orange tabby cat is seen happily darting through a dense garden, as if chasing something. Its eyes are wide and happy as it jogs forward, scanning the branches, flowers, and leaves as it walks. The path is narrow as it makes its way between all the plants. the scene is captured from a ground-level angle, following the cat closely, giving a low and intimate perspective. The image is cinematic with warm tones and a grainy texture. The scattered daylight between the leaves and plants above creates a warm contrast, accentuating the cat’s orange fur. The shot is clear and sharp, with a shallow depth of field."

Diffusers:

https://github.com/user-attachments/assets/7f6db0ee-db6a-4f28-9dd0-288f41d61a43

SAT:

https://github.com/user-attachments/assets/4526c32d-9a65-4c50-ae86-9aace94cb4a2

  1. Prompt: "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about."

Diffusers:

https://github.com/user-attachments/assets/46788cfd-7986-4d26-b7ae-5b06929b98ed

SAT:

https://github.com/user-attachments/assets/561e52e1-b1e5-49e4-8115-7cda950d9a3b

It would be very kind of authors to look into this issue. It will help the research community to build exciting projects upon CogVideoX. Truly appreciate your help on this issue. Looking forward to your reply.

Best, Jiarui

Information / 问题信息

Reproduction / 复现过程

Run provided inference files Diffusers: https://github.com/THUDM/CogVideo/blob/main/inference/cli_demo.py

SAT: https://github.com/THUDM/CogVideo/blob/main/sat/inference.sh

Expected behavior / 期待表现

The SAT results are expected to be at the same level quality as Diffusers.

tengjiayan20 commented 2 weeks ago

image You need to use bf16 for 5B model.

xvjiarui commented 2 weeks ago

Thank you so much for your quick reply. I just fixed that. And it seems the quality is significantly improved. But still a little bit worse than diffusers. Do you have any clue? Or it's just because random sampling?

https://github.com/user-attachments/assets/785ae31d-41bb-47f9-bd2c-e67755105baf

https://github.com/user-attachments/assets/b351a8b8-28a3-4a77-8de8-f79dd0ea88bc

zRzRzRzRzRzRzR commented 2 weeks ago

Make sure to compare the original videos generated by Diffusers, not the ones enhanced with super-resolution and interpolation.

If everything has been aligned, then the difference is likely due to randomness. The Diffusers model was migrated from this model’s weights without any additional training.

xvjiarui commented 1 week ago

Thank you so much! I have checked it's aligned. So it may due to randomness.