PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

confusion about VIDAL-10M video-text data #44

Open wli333 opened 2 months ago

wli333 commented 2 months ago

Thanks for your effort pushing MLLM into the next stage. Recently, I want to follow your work, and download VIDAL-10M video-text data id2title_folder_raw_ofa_mplug_gpt_sound10076613.json.

I found it contain around 10M video-text, I have following question wish you could give me some hints.

  1. what's the difference between this 10M video-text and 3M video-text mentioned in your ICLR paper.
  2. Regarding to this 10M video-text, I found many video's raw(including title and hashtag) contains some words like youtube, shorts. Take youtube ID LbxMRY4_W10 for example, its raw is I kicked this ball higher than Ja Morant can jump! #shorts #youtubeshorts #youtube #shortclips. But in your paper, you mention "we removed irrelevant words and hashtags, such as ”youtube”, ”fyp”, ”shorts”, etc".

Thanks in advance.