How to do batch inference on CogVLM2-Caption?

@Yuancheng-Xu Hi boy, I have extracted the optimal clustered shot descriptions from the original 22-frame video based on scene judgment using a model. Along with this, I have also obtained the descriptions of the first and last frames for each chunk of the video within the scenes, resulting in this dataset. There is a detailed dataset processing method available in our Feishu app documentation, which is publicly accessible. This is public information.

It contains the entire dataset preprocessing process and ablation experiment information of lora fine-tuning, but it is in Chinese. I'm sorry that I haven't had time to translate it into English.

script: https://zhipu-ai.feishu.cn/wiki/Ln9dw9ohpiFymekjeabc8TTinRd

Ablation experiment：https://zhipu-ai.feishu.cn/wiki/OjIDwMEKniIby1kHQa4cMKibnhP

THUDM / CogVideo

How to do batch inference on CogVLM2-Caption? #327