OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.24k stars 78 forks source link

What do DIV and FLT stand for? #91

Open vedantroy opened 4 months ago

vedantroy commented 4 months ago

I see there are 3 subsets: DIV, FLT, and the aesthetic version. What are the filtering criteria used for DIV and FLT, and what do they stand for?

shepnerd commented 4 months ago

DIV and FLT stand for diverse sampling and filtering respectively. Specifically, for DIV (diversity sampling), we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies. For FLT (filtering), we applied a series of filtering strategies to video data alongside DIV sampling. These included: a) Removing video clips shorter than 1s (approximately 23.15% of the total) or longer than 120s (around 0.84% of the total). b) Computing CLIPScore for each video clip using a randomly sampled frame from the clip with OpenAI’s CLIP-ViT-L/14, then selecting clips within the top 30% of CLIPScores. c) Sampling 10M out of the remaining clips using DIV sampling. You can refer to the Sec. E.1. of appendix of this paper.

vedantroy commented 4 months ago

Got it, and thanks for the fast response! 4 follow-ups (the first one is the most important):

  1. Have you released the JSONL for the full set of 230M clips?

After the filtering, we get total 234M video clips whose durations range from 2s to more than 30s.

  1. Does the aesthetic dataset do any sort of filtering by CLIP score? (I'm guessing not, but wanted to confirm) Also, how did you determine what a high aesthetic score was? (Top 10%? Above some constant? etc.)

  2. Is this passage:

we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies

Saying "if there are many clips from the same video, we sample those clips less" (presumably in order to avoid over sampling from longer videos?)

  1. Is there a reason you used CLIPScore using CLIP-ViT-L/14 instead of using the UMT_Score when calculating video-caption similarity?