Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install the required packages.
conda create --name ws_dual_py3 python=3.8
conda activate ws_dual_py3
git clone https://github.com/danieljf24/hybrid_space.git
cd hybrid_space
pip install -r requirements.txt
conda deactivate
Run the following script to download and extract MSR-VTT (msrvtt10k-resnext101_resnet152.tar.gz(4.3G)) dataset and a pre-trained word2vec (vec500flickr30m.tar.gz(3.0G). The data can also be downloaded from Baidu pan (url, password:p3p0) or Google drive (url). For more information about the dataset, please refer to here.
The extracted data is placed in $HOME/VisualSearch/
.
ROOTPATH=$HOME/VisualSearch
mkdir -p $ROOTPATH && cd $ROOTPATH
# download and extract dataset
wget http://8.210.46.84:8787/tpami2021/msrvtt10k-resnext101_resnet152.tar.gz
tar zxf msrvtt10k-resnext101_resnet152.tar.gz -C $ROOTPATH
# download and extract pre-trained word2vec
wget http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz
tar zxf word2vec.tar.gz -C $ROOTPATH
Run the following script to train and evaluate Dual Encoding
network with hybrid space on the official
partition of MSR-VTT. The video features are the concatenation of ResNeXt-101 and ResNet-152 features. The code of video feature extraction we used in the paper is available at here.
conda activate ws_dual_py3
./do_all.sh msrvtt10k hybrid resnext101-resnet152
Running the script will do the following things:
Dual Encoding
network with hybrid space and select a checkpoint that performs best on the validation set as the final model. Notice that we only save the best-performing checkpoint on the validation set to save disk space../do_vocab_concept.sh msrvtt10k 1 $ROOTPATH
.If you would like to train Dual Encoding
network with the latent space learning (Conference Version), please run the following scrip:
./do_all.sh msrvtt10k latent resnext101-resnet152 $ROOTPATH
To train the model on the Test1k-Miech
partition and Test1k-Yu
partition of MSR-VTT, please run the following scrip:
./do_all.sh msrvtt10kmiech hybrid resnext101-resnet152 $ROOTPATH
./do_all.sh msrvtt10kyu hybrid resnext101-resnet152 $ROOTPATH
The overview of pre-trained checkpoints on MSR-VTT is as follows. | Split | Pre-trained Checkpoints |
---|---|---|
Official | msrvtt10k_model_best.pth.tar(264M) | |
Test1k-Miech | msrvtt10kmiech_model_best.pth.tar(267M) | |
Test1k-Yu | msrvtt10kyu_model_best.pth.tar(267M) |
Note that if you would like to evaluate using our trained checkpoints, please make sure to use the vocabulary and concept annotations that are provided in the msrvtt10k-resnext101_resnet152.tar.gz
.
Run the following script to download and evaluate our trained checkpoints on the official split of MSR-VTT. The trained checkpoints can also be downloaded from Baidu pan (url, password:p3p0).
MODELDIR=$HOME/VisualSearch/checkpoints
mkdir -p $MODELDIR
# download trained checkpoints
wegt -P $MODELDIR http://8.210.46.84:8787/tpami2021/checkpoints/msrvtt10k_model_best.pth.tar
# evaluate on the official split of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10k --logger_name $MODELDIR --checkpoint_name msrvtt10k_model_best.pth.tar
In order to evaluate on Test1k-Miech
and Test1k-Yu
splits, please run the following script.
MODELDIR=$HOME/VisualSearch/checkpoints
# download trained checkpoints on Test1k-Miech
wegt -P $MODELDIR http://8.210.46.84:8787/tpami2021/checkpoints/msrvtt10kmiech_model_best.pth.tar
# evaluate on Test1k-Miech of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10kmiech --logger_name $MODELDIR --checkpoint_name msrvtt10kmiech_model_best.pth.tar
MODELDIR=$HOME/VisualSearch/checkpoints
# download trained checkpoints on Test1k-Yu
wegt -P $MODELDIR http://8.210.46.84:8787/tpami2021/checkpoints/msrvtt10kyu_model_best.pth.tar
# evaluate on Test1k-Yu of MSR-VTT
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection msrvtt10kyu --logger_name $MODELDIR --checkpoint_name msrvtt10kyu_model_best.pth.tar
The expected performance of Dual Encoding on MSR-VTT is as follows. Notice that due to random factors in SGD based training, the numbers differ slightly from those reported in the paper.
Split | Text-to-Video Retrieval | Video-to-Text Retrieval | SumR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | MedR | mAP | R@1 | R@5 | R@10 | MedR | mAP | ||
Official | 11.8 | 30.6 | 41.8 | 17 | 21.4 | 21.6 | 45.9 | 58.5 | 7 | 10.3 | 210.2 |
Test1k-Miech | 22.7 | 50.2 | 63.1 | 5 | 35.6 | 24.7 | 52.3 | 64.2 | 5 | 37.2 | 277.2 |
Test1k-Yu | 21.5 | 48.8 | 60.2 | 6 | 34.0 | 21.7 | 49.0 | 61.4 | 6 | 34.6 | 262.6 |
Download VATEX dataset (vatex-i3d.tar.gz(3.0G)) and a pre-trained word2vec (vec500flickr30m.tar.gz(3.0G)). The data can also be downloaded from Baidu pan (url, password:p3p0) or Google drive (url). For more information about the dataset, please refer to here. Please extract data into $HOME/VisualSearch/
.
Run the following script to train and evaluate Dual Encoding
network with hybrid space on VATEX.
# download and extract dataset
wget http://8.210.46.84:8787/tpami2021/vatex-i3d.tar.gz
tar zxf vatex-i3d.tar.gz -C $ROOTPATH
./do_all.sh vatex hybrid i3d_kinetics $ROOTPATH
Run the following script to download and evaluate our trained model (vatex_model_best.pth.tar(230M)) on VATEX.
MODELDIR=$HOME/VisualSearch/checkpoints
# download trained checkpoints
wegt -P $MODELDIR http://8.210.46.84:8787/tpami2021/checkpoints/vatex_model_best.pth.tar
CUDA_VISIBLE_DEVICES=0 python tester.py --testCollection vatex --logger_name $MODELDIR --checkpoint_name vatex_model_best.pth.tar
The expected performance of Dual Encoding with hybrid space learning on VATEX is as follows.
Split | Text-to-Video Retrieval | Video-to-Text Retrieval | SumR | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
R@1 | R@5 | R@10 | MedR | mAP | R@1 | R@5 | R@10 | MedR | mAP | ||
VATEX | 35.8 | 72.8 | 82.9 | 2 | 52.0 | 47.5 | 76.0 | 85.3 | 2 | 39.1 | 400.3 |
The following datasets are used for training, validation and testing: the joint collection of MSR-VTT and TGIF, tv2016train and IACC.3. For more information about these datasets, please refer to here.
Please download the frame-level features from Baidu pan (url, password:qwlc). The filename of feature data are summarized as follows. | Datasets | 2048-dim ResNeXt-101 | 2048-dim ResNet-152 |
---|---|---|---|
MSR-VTT | msrvtt10k_ResNext-101.tar.gz | msrvtt10k_ResNet-152.tar.gz | |
TGIF | tgif_ResNext-101.tar.gz | tgif_ResNet-152.tar.gz | |
tv2016train | tv2016train_ResNext-101.tar.gz | tv2016train_ResNet-152.tar.gz | |
IACC.3 | iacc.3_ResNext-101.tar.gz | iacc.3_ResNet-152.tar.gz |
Note if you have already download MSR-VTT data we provide above, you need not download msrvtt10k_ResNext-101.tar.gz
and msrvtt10k_ResNet-152.tar.gz
.
Please download the above data, and run the following scripts to extract them into $HOME/VisualSearch/
.
ROOTPATH=$HOME/VisualSearch
# extract ResNext-101
tar zxf tgif_ResNext-101.tar.gz -C $ROOTPATH
tar zxf msrvtt10k_ResNext-101.tar.gz -C $ROOTPATH
tar zxf tv2016train_ResNext-101.tar.gz -C $ROOTPATH
tar zxf iacc.3_ResNext-101.tar.gz -C $ROOTPATH
# extract ResNet-152
tar zxf tgif_ResNet-152.tar.gz -C $ROOTPATH
tar zxf msrvtt10k_ResNet-152.tar -C $ROOTPATH
tar zxf tv2016train_ResNet-152.tar.gz -C $ROOTPATH
tar zxf iacc.3_ResNet-152.tar.gz -C $ROOTPATH
# combine feature of tgif and msrvtt10k
./do_combine_features.sh
ROOTPATH=$HOME/VisualSearch
trainCollection=tgif-msrvtt10k
overwrite=0
# Generate a vocabulary on the training set
./util/do_get_vocab.sh $trainCollection $ROOTPATH $overwrite
# Generate concepts according to video captions
./util/do_get_tags.sh $trainCollection $ROOTPATH $overwrite
# Generate video frame info
visual_feature=resnext101-resnet152
./util/do_get_frameInfo.sh $trainCollection $visual_feature $ROOTPATH $overwrite
# training and testing
./do_all_avs.sh $ROOTPATH
Our code supports dataset structure:
Single-folder structure
: train, validation and test subset are stored in a folder.Multiple-folder structure
: train, validation and test subset are stored in three folders respectively.Store the train, validation and test subset into a folder in the following structure.
${collection}
├── FeatureData
│ └── ${feature_name}
│ ├── feature.bin
│ ├── shape.txt
│ └── id.txt
└── TextData
└── ${collection}train.caption.txt
└── ${collection}val.caption.txt
└── ${collection}test.caption.txt
FeatureData
: video frame features. Using txt2bin.py to convert video frame feature in the required binary format.${collection}train.caption.txt
: training caption data.${collection}val.caption.txt
: validation caption data.${collection}test.caption.txt
: test caption data.
The file structure is as follows, in which the video and sent in the same line are relevant.
video_id_1#1 sentence_1
video_id_1#2 sentence_2
...
video_id_n#1 sentence_k
...
Please run the script to generate vocabulary and concepts:
./util/do_vocab_concept.sh $collection 0 $ROOTPATH
Run the following script to train and evaluate Dual Encoding on your own dataset:
./do_all_singlefolder.sh ${collection} hybrid ${feature_name} ${rootpath}
Store the training, validation and test subsets into three folders in the following structure respectively.
${subset_name}
├── FeatureData
│ └── ${feature_name}
│ ├── feature.bin
│ ├── shape.txt
│ └── id.txt
└── TextData
└── ${subset_name}.caption.txt
FeatureData
: video frame features.${dsubset_name}.caption.txt
: caption data of corresponding subset.You can run the following script to check whether the data is ready:
./do_format_check.sh ${train_set} ${val_set} ${test_set} ${rootpath} ${feature_name}
where train_set
, val_set
and test_set
indicate the name of training, validation and test set, respectively, ${rootpath} denotes the path where datasets are saved and feature_name
is the video frame feature name.
Please run the script to generate vocabulary and concepts:
./util/do_vocab_concept.sh ${train_set} 0 $ROOTPATH
If you pass the format check, use the following script to train and evaluate Dual Encoding on your own dataset:
./do_all_multifolder.sh ${train_set} ${val_set} ${test_set} hybrid ${feature_name} ${rootpath}
If you find the package useful, please consider citing our TPAMI'21 or CVPR'19 paper:
@article{dong2021dual,
title={Dual Encoding for Video Retrieval by Text},
author={Dong, Jianfeng and Li, Xirong and Xu, Chaoxi and Yang, Xun and Yang, Gang and Wang, Xun and Wang, Meng},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
doi = {10.1109/TPAMI.2021.3059295},
year={2021}
}
@inproceedings{cvpr2019-dual-dong,
title = {Dual Encoding for Zero-Example Video Retrieval},
author = {Jianfeng Dong and Xirong Li and Chaoxi Xu and Shouling Ji and Yuan He and Gang Yang and Xun Wang},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2019},
}