THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.52k stars 696 forks source link

How to do batch inference on CogVLM2-Caption? #327

Open Yuancheng-Xu opened 4 hours ago

Yuancheng-Xu commented 4 hours ago

I have a dataset of videos with different lengths of videos, so the number of frames fed into CogVLM2-Caption are different.

How to do batch inference of CogVLM2-Caption (several videos at a time), especially considering that every video may require different number of frames? Are there any reference code on this? Thank you a lot!

glide-the commented 3 hours ago

@Yuancheng-Xu Hi boy, I have extracted the optimal clustered shot descriptions from the original 22-frame video based on scene judgment using a model. Along with this, I have also obtained the descriptions of the first and last frames for each chunk of the video within the scenes, resulting in this dataset. There is a detailed dataset processing method available in our Feishu app documentation, which is publicly accessible. This is public information.

It contains the entire dataset preprocessing process and ablation experiment information of lora fine-tuning, but it is in Chinese. I'm sorry that I haven't had time to translate it into English.

script: https://zhipu-ai.feishu.cn/wiki/Ln9dw9ohpiFymekjeabc8TTinRd

Ablation experiment:https://zhipu-ai.feishu.cn/wiki/OjIDwMEKniIby1kHQa4cMKibnhP