Authors: Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
Paper link: arXiv
Introduction: UCoFiA captures the cross-modal similarity information at different granularity levels(video-sentence, frame-sentence, pixel-word) and unifies multi-level alignments for video-text retrieval. Our model achieves state-of-the-art results in five benchmark datasets including MSR-VTT, MSVD, Activity-Net, DiDeMo, and VATEX.
# train code for UCoFiA
./train
# video-to-text retrieval evaluation with SK-norm
./eval_v2t
# text-to-video retrieval evaluation with SK-norm
./eval_t2v
We release the UCOFIA's checkpoint on the MSR-VTT dataset in https://drive.google.com/file/d/1-zynuNEI1u0TwNxw2R163kX4kocWDsIu/view?usp=sharing
conda create -n ucofia python=3.8
conda activate ucofia
install pytorch: torch==1.12.1, torchvision==0.13.1
install other dependencies
pip install -r requirements.txt
wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
We test our model on:
Please refer to CLIP4Clip for downloading the first 4 datasets and TS2-Net for downloading VATEX dataset.
We follow CLIP4Clip to compress all video to 3fps, 224*224.
We provide UCoFiA training and evaluation script examples as follows. Please customize your data path in the scripts.
cd ./train
sh scripts/train_msrvtt.sh
To leverage SK-norm in inference time, please first modify the checkpoint path in line xxx of ./eval_t2v/main_ucofia.py to the saved best checkpoint path in the training stage (you can evaluate multiple checkpoints for better results).
cd ./eval_t2v
sh scripts/eval_msrvtt.sh
To leverage SK-norm in inference time, please first modify the checkpoint path in line xxx of ./eval_t2v/main_ucofia.py to the saved best checkpoint path in the training stage (you can evaluate multiple checkpoints for better results).
cd ./eval_v2t
sh scripts/eval_msrvtt.sh
We thank the developers of X-CLIP, TS2-Net, CLIP4Clip, CLIP for their public code release. We also thank the authors of NCL for the helpful discussion.
Please cite our paper if you use our models in your works:
@article{wang2023unified,
title={Unified Coarse-to-Fine Alignment for Video-Text Retrieval},
author={Wang, Ziyang and Sung, Yi-Lin and Cheng, Feng and Bertasius, Gedas and Bansal, Mohit},
journal={arXiv preprint arXiv:2309.10091},
year={2023}
}