Open Yuancheng-Xu opened 4 hours ago
@Yuancheng-Xu Hi boy, I have extracted the optimal clustered shot descriptions from the original 22-frame video based on scene judgment using a model. Along with this, I have also obtained the descriptions of the first and last frames for each chunk of the video within the scenes, resulting in this dataset. There is a detailed dataset processing method available in our Feishu app documentation, which is publicly accessible. This is public information.
It contains the entire dataset preprocessing process and ablation experiment information of lora fine-tuning, but it is in Chinese. I'm sorry that I haven't had time to translate it into English.
script: https://zhipu-ai.feishu.cn/wiki/Ln9dw9ohpiFymekjeabc8TTinRd
Ablation experiment:https://zhipu-ai.feishu.cn/wiki/OjIDwMEKniIby1kHQa4cMKibnhP
I have a dataset of videos with different lengths of videos, so the number of frames fed into CogVLM2-Caption are different.
How to do batch inference of CogVLM2-Caption (several videos at a time), especially considering that every video may require different number of frames? Are there any reference code on this? Thank you a lot!