OmkarThawakar / composed-video-retrieval

Composed Video Retrieval
Apache License 2.0
30 stars 0 forks source link

๐ŸŽฅ โž• ๐Ÿ“ โžก๏ธ ๐ŸŽฅ Composed Video Retrieval via Enriched Context and Discriminative Embeddings [CVPR-2024]

license

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah and Fahad Khan

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

paper GitHub Stars

hline

Overview

> Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score.

Dataset

To download the webvid-covr videos, install mpi4py and run:

python tools/scripts/download_covr.py <split>

To download the annotations of webvid-covr:

bash tools/scripts/download_annotation.sh covr

Generate Descriptions (optional)

To generate the descriptions of webvid-covr videos, use script tools/scripts/generate_webvid_description_2m.py and tools/scripts/generate_webvid_description_8m.py inside main directory of MiniGPT-4

Download webvid-covr annotations with our generated descriptions

Download the webvid-covr annotation files with our generated descriptions from here : OneDrive Link

Model Checkpoints

Download the model checkpoints from here : OneDrive Link. Save the checkpoint in folder structure : outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/

Final repository contains:

๐Ÿ“ฆ composed-video-retrieval
 โ”ฃ ๐Ÿ“‚ annotations
 โ”ฃ ๐Ÿ“‚ configs 
 โ”ฃ ๐Ÿ“‚ datasets 
 โ”ฃ ๐Ÿ“‚ outputs                
 โ”ฃ ๐Ÿ“‚ src                     
 โ”ฃ ๐Ÿ“‚ tools                   
 โ”ฃ ๐Ÿ“œ LICENSE
 โ”ฃ ๐Ÿ“œ README.md
 โ”ฃ ๐Ÿ“œ test.py
 โ”— ๐Ÿ“œ train.py

Installation

Create environment
conda create --name covr
conda activate covr

Install the following packages inside the conda environment:

pip install -r requirements.txt

The code was tested on Python 3.10 and PyTorch >= 2.0.

Usage :computer:

Computing BLIP embeddings

Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run:

python tools/embs/save_blip_embs_vids.py # This will compute the embeddings for the WebVid-CoVR videos.
python tools/embs/save_blip_embs_imgs.py # This will compute the embeddings for the CIRR or FashionIQ images.

Training

The command to launch a training experiment is the folowing:

python train.py [OPTIONS]

The parsing is done by using the powerful Hydra library. You can override anything in the configuration by passing arguments like foo=value or foo.bar=value.

Evaluation

The command to evaluate is the folowing:

python test.py test=<test> [OPTIONS]

Options parameters

Datasets:

Tests:

Checkpoints:

Training

Logging

Machine

Experiment

There are many pre-defined experiments from the paper in configs/experiments. Simply add experiment=<experiment> to the command line to use them.

SLURM setting

Use slurm_train.sh and slurm_test.sh in case of slurm setting.

Acknowledgements

Citation

  @article{thawakar2024composed,
          title={Composed Video Retrieval via Enriched Context and Discriminative Embeddings},
          author={Omkar Thawakar and Muzammal Naseer and Rao Muhammad Anwer and Salman Khan and Michael Felsberg and Mubarak Shah and Fahad Shahbaz Khan},
          journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2024}
  }