UCoFiA: Unified Coarse-to-Fine Alignment for Video-Text Retrieval (ICCV 23)

Authors: Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
Paper link: arXiv
Introduction: UCoFiA captures the cross-modal similarity information at different granularity levels(video-sentence, frame-sentence, pixel-word) and unifies multi-level alignments for video-text retrieval. Our model achieves state-of-the-art results in five benchmark datasets including MSR-VTT, MSVD, Activity-Net, DiDeMo, and VATEX.

Code structure


# train code for UCoFiA
./train

# video-to-text retrieval evaluation with SK-norm
./eval_v2t

# text-to-video retrieval evaluation with SK-norm
./eval_t2v

Update

We release the UCOFIA's checkpoint on the MSR-VTT dataset in https://drive.google.com/file/d/1-zynuNEI1u0TwNxw2R163kX4kocWDsIu/view?usp=sharing

Setup

Install Dependencies

(Optional) Creating conda environment

conda create -n ucofia python=3.8
conda activate ucofia

install pytorch: torch==1.12.1, torchvision==0.13.1
install other dependencies
```
pip install -r requirements.txt
```

Download CLIP (ViT-B/32) weight

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

Dataset Preparation

Download data

We test our model on:

Please refer to CLIP4Clip for downloading the first 4 datasets and TS2-Net for downloading VATEX dataset.

Compress raw video

We follow CLIP4Clip to compress all video to 3fps, 224*224.

Training and Inference

We provide UCoFiA training and evaluation script examples as follows. Please customize your data path in the scripts.

1) Train UCoFiA on MSRVTT dataset

cd ./train
sh scripts/train_msrvtt.sh

2) Evaluate text-to-video retrieval with SK-norm on MSR-VTT dataset

To leverage SK-norm in inference time, please first modify the checkpoint path in line xxx of ./eval_t2v/main_ucofia.py to the saved best checkpoint path in the training stage (you can evaluate multiple checkpoints for better results).

cd ./eval_t2v
sh scripts/eval_msrvtt.sh

3) Evaluate video-to-text retrieval with SK-norm on MSR-VTT dataset

cd ./eval_v2t
sh scripts/eval_msrvtt.sh

Acknowledgments

We thank the developers of X-CLIP, TS2-Net, CLIP4Clip, CLIP for their public code release. We also thank the authors of NCL for the helpful discussion.

Reference

Please cite our paper if you use our models in your works:


@article{wang2023unified,
  title={Unified Coarse-to-Fine Alignment for Video-Text Retrieval},
  author={Wang, Ziyang and Sung, Yi-Lin and Cheng, Feng and Bertasius, Gedas and Bansal, Mohit},
  journal={arXiv preprint arXiv:2309.10091},
  year={2023}
}

Ziyang412 / UCoFiA

readme