Thanks for your effort pushing MLLM into the next stage. Recently, I want to follow your work, and download VIDAL-10M video-text data id2title_folder_raw_ofa_mplug_gpt_sound10076613.json.
I found it contain around 10M video-text, I have following question wish you could give me some hints.
what's the difference between this 10M video-text and 3M video-text mentioned in your ICLR paper.
Regarding to this 10M video-text, I found many video's raw(including title and hashtag) contains some words like youtube, shorts. Take youtube ID LbxMRY4_W10 for example, its raw is I kicked this ball higher than Ja Morant can jump! #shorts #youtubeshorts #youtube #shortclips. But in your paper, you mention "we removed irrelevant words and hashtags, such as ”youtube”, ”fyp”, ”shorts”, etc".
Thanks for your effort pushing MLLM into the next stage. Recently, I want to follow your work, and download VIDAL-10M video-text data id2title_folder_raw_ofa_mplug_gpt_sound10076613.json.
I found it contain around 10M video-text, I have following question wish you could give me some hints.
Thanks in advance.