bpiyush / TestOfTime

Official code for our CVPR 2023 paper: Test of Time: Instilling Video-Language Models with a Sense of Time
MIT License
45 stars 3 forks source link

Missing Datasets, Dataset Info #1

Closed mgwillia closed 1 year ago

mgwillia commented 1 year ago

I can’t find code for any datasets other than synthetic and TEMPO. Additionally, when I download the data from your page, Charades is missing splits, annotations, etc. (everything other than feats) and ActivityNet appears to be missing entirely, as well as CharadesEgo. Could you please fill in those gaps for me, or provide more details about your splits?

bpiyush commented 1 year ago

Hey @mgwillia! Thanks for your interest.

Code: We only cleaned and released code for TEMPO and Synthetic as examples but the structure shall be very similar for Charades and other dataset objects.

Splits: I realise that in my initial data release, Charades splits were absent. Apologies for that. I have updated the same link (charades.zip. It will now have feat/ folder and splits_public/ folder which will have train/val/test splits with artificially stitched pairs of non-overlapping clips. I hope this helps.

As for the other datasets, you can likewise find the splits here: charadesego.zip and activitynet.zip. For ActivityNet, the features are too large for us to host, so we do not provide features here. You can refer to this code for S3D feature extraction.

Note on split creation: In general, we follow these steps in creating splits. We take a dense video captioning dataset, stitch any two non-overlapping clips in a video and stitch the corresponding captions with before or after relations. We retain the train set split from the original dataset and create pairs within the split. From the evaluation split in the original dataset, we make two sub-splits, one for validation and the other for testing. Let me know if you have any further questions.

mgwillia commented 1 year ago

Some follow-up: I'm just getting around to ActivityNet now, and I'm realized that the feature extraction code outputs 1024dim s3dg features, whereas with TEMPO, for example, 512dim features were provided. Should I be changing some setting with feature extraction to get the smaller features? Or is there some configuration with VideoCLIP that I should change?

bpiyush commented 1 year ago

Hi @mgwillia ,

I realise that there are minor changes that VideoCLIP made to compute S3D features. I followed their code (steps here) which is in turn adapted from the link I shared in the previous message. Apologies for the confusion.

In short, you need to clone and set up the fairseq repo from FAIR. Then, navigate to the folder fairseq/examples/MMPT/ and run

python scripts/video_feature_extractor/extract.py \
    --vdir <path_to_video_folder> \
    --fdir data/feat/feat_how2_s3d \
    --type=s3d --num_decoding_thread=4 \
    --batch_size 32 --half_precision 1
bpiyush commented 1 year ago

Closing this issue for now. If there is a follow up, please feel free to re-open.