Closed xvjiarui closed 1 week ago
You need to use bf16 for 5B model.
Thank you so much for your quick reply. I just fixed that. And it seems the quality is significantly improved. But still a little bit worse than diffusers. Do you have any clue? Or it's just because random sampling?
https://github.com/user-attachments/assets/785ae31d-41bb-47f9-bd2c-e67755105baf
https://github.com/user-attachments/assets/b351a8b8-28a3-4a77-8de8-f79dd0ea88bc
Make sure to compare the original videos generated by Diffusers, not the ones enhanced with super-resolution and interpolation.
If everything has been aligned, then the difference is likely due to randomness. The Diffusers model was migrated from this model’s weights without any additional training.
Thank you so much! I have checked it's aligned. So it may due to randomness.
System Info / 系統信息
Hi CogVideo Team,
First of all, thank you so much for open-sourcing such great models for community to research text-video generative models.
I tried out both Diffusers and SAT codebase, and I found out the sampling results from SAT are much worse than Diffusers. Here is some example:
Diffusers:
https://github.com/user-attachments/assets/7f6db0ee-db6a-4f28-9dd0-288f41d61a43
SAT:
https://github.com/user-attachments/assets/4526c32d-9a65-4c50-ae86-9aace94cb4a2
Diffusers:
https://github.com/user-attachments/assets/46788cfd-7986-4d26-b7ae-5b06929b98ed
SAT:
https://github.com/user-attachments/assets/561e52e1-b1e5-49e4-8115-7cda950d9a3b
It would be very kind of authors to look into this issue. It will help the research community to build exciting projects upon CogVideoX. Truly appreciate your help on this issue. Looking forward to your reply.
Best, Jiarui
Information / 问题信息
Reproduction / 复现过程
Run provided inference files Diffusers: https://github.com/THUDM/CogVideo/blob/main/inference/cli_demo.py
SAT: https://github.com/THUDM/CogVideo/blob/main/sat/inference.sh
Expected behavior / 期待表现
The SAT results are expected to be at the same level quality as Diffusers.