We provide a faster version to extract object from WebVid 2.5M and CC 3M. We extract objects of 5.5M * 8 = 44M frames in total and it takes 28 days on 16 V100 GPUs.
Refer to Object Extractor.md for more details.
Refer to train.md for more details.
In this code, we provide two ways to visualize cross-modality attention.
Please refer to visualization.md for details.
If you find our work helpful, please cite our paper
@article{wang2022oatrans,
title={Object-aware Video-language Pre-training for Retrieval},
author={Wang, Alex Jinpeng and Ge, Yixiao and Cai, Guanyu and Yan, Rui and Lin, Xudong and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
journal={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2022}
}
This work is mainly based on Frozen.