DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
752 stars 50 forks source link

Webvid-10M (40% sampling) #56

Open VJatla opened 2 months ago

VJatla commented 2 months ago

Hello,

After going through the paper, I understood that 40% of video-text pairs are used from webvid-10M dataset. Can you please provide me the rationale, or, point me in the direction which helps me understand how these 40% of video are picked.