X-PLUG / mPLUG-2

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video (ICML 2023)
Apache License 2.0
220 stars 18 forks source link

tran/test split json file for MSR-VTT caption task reproduce #25

Open naajeehxe opened 2 months ago

naajeehxe commented 2 months ago

Thank you for your wonderful project!

Could you provide the train/test split JSON files for the MSR-VTT caption dataset? I am unable to access the following files:

•   datasets/annotations_all/msrvtt_caption/train.jsonl
•   datasets/annotations_all/msrvtt_caption/test.jsonl
naajeehxe commented 2 months ago

From my understanding, you used 1k samples for the test set. To accurately reproduce the results from the paper, could you please provide the sample IDs used for the test set?

idj3tboy commented 2 months ago

Yes, me too. While trying to reproduce the results. I couldnt find the files mentioned by @naajeehxe plus the following file: 'datasets/annotations_all/msvd_caption/train.jsonl' It would be great if you could let us know how to generate the same.

naajeehxe commented 2 months ago

@idj3tboy I’m not sure if this will be helpful, but I’d like to share how I did it. I downloaded the data from (https://cove.thecvf.com/datasets/839) and used the following two txt files for the train/test split:

•   MSRVTT/videos/train_list_new.txt
•   MSRVTT/videos/test_list_new.txt

As a result, I got 7010 train data and 2990 test data. I’m not exactly sure what the 9k/1k train/test data used in the paper refers to, but I was able to reproduce results similar to the paper using this 7k/3k train/test split.

If you’re in a hurry, it might be a good idea to give it a try!