OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.43k stars 88 forks source link

Questions about the composition of the dataset #202

Open jiyi-zyh opened 1 month ago

jiyi-zyh commented 1 month ago

Very meaningful work! I would like to inquire about something. I have downloaded the InternVid-10M-FLT dataset from OpenDataLab (link: https://opendatalab.org.cn/vd-foundation/InternVid-10M-FLT) and I am looking to extract features from it. In the article "VTimeLLM: Empower LLM to Grasp Video Moments," the VTimeLLM model utilizes the InternVid-10M-FLT dataset during its second stage of training. However, I have noticed that in the training set file stage2.json of the VTimeLLM model, each video ID corresponds to more than one original video from the InternVid-10M-FLT dataset that I downloaded (for a detailed example, see the image below; the original video with ID 3n3oCNerzV0 has three segmented original videos in the dataset I downloaded, and the segmented videos are in InternVId-FLT_1). This has left me unsure about how to extract features. I would like to ask if there is a mistake in my procedure or if I have misunderstood the InternVid-10M-FLT dataset. Thank you very much for your help! 微信图片_20241021234116 微信图片_20241021234127