OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.16k stars 75 forks source link

Details about generating video captions for InterVid #142

Open fmthoker opened 2 weeks ago

fmthoker commented 2 weeks ago

Dear authors, Can you share some details about how we can generate the captions for new videos in the same manner as done for Intervid? From the paper, you generated a single caption for the middle frame using BLIP-2 and a frame-by-frame caption using Tag2Text model at a low fps. Can you share some details about the fps used for the Tag2Text part, and, how many frames were used for each video? is the number of frames fixed or variable based on the video length? Any other details would be helpful. Finally, how did you summarize all the captions using the T5-summary model, any specific prompts?

yinanhe commented 2 weeks ago

Thank you for your interest in our work. In InternVid dataset, we employed the tag2Text model to capture frames at a rate of 1 frame per second and produce image-level captions. However, given the somewhat repetitive descriptions generated by tag2Text, we integrated BLIP-2 to enhance the richness of the captions. Additionally, we included descriptions of intermediate frames in the overall narrative. When it came to summarizing with the T5-summarize model, its prior training in summarization tasks eliminated the need for elaborate prompt crafting.

Additionally, please allow me to introduce to you our VideoChat2-HD, a more accurate and detailed multimodal video model. All you need to do is input the video into the model and use a simple prompt such as Describe the video in detail. The model will then generate descriptions that are richer and more precise than those produced by internvid.