🎥 ➕ 📝 ➡️ 🎥 Composed Video Retrieval via Enriched Context and Discriminative Embeddings [CVPR-2024]

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah and Fahad Khan

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

hline

Overview

> Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score.

Dataset

To download the webvid-covr videos, install mpi4py and run:

python tools/scripts/download_covr.py <split>

To download the annotations of webvid-covr:

bash tools/scripts/download_annotation.sh covr

Generate Descriptions (optional)

To generate the descriptions of webvid-covr videos, use script tools/scripts/generate_webvid_description_2m.py and tools/scripts/generate_webvid_description_8m.py inside main directory of MiniGPT-4

Download webvid-covr annotations with our generated descriptions

Download the webvid-covr annotation files with our generated descriptions from here : OneDrive Link

Model Checkpoints

Download the model checkpoints from here : OneDrive Link. Save the checkpoint in folder structure : outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/

Final repository contains:

📦 composed-video-retrieval
 ┣ 📂 annotations
 ┣ 📂 configs 
 ┣ 📂 datasets 
 ┣ 📂 outputs                
 ┣ 📂 src                     
 ┣ 📂 tools                   
 ┣ 📜 LICENSE
 ┣ 📜 README.md
 ┣ 📜 test.py
 ┗ 📜 train.py

Installation

Create environment

conda create --name covr
conda activate covr

Install the following packages inside the conda environment:

pip install -r requirements.txt

The code was tested on Python 3.10 and PyTorch >= 2.0.

Usage :computer:

Computing BLIP embeddings

Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run:

python tools/embs/save_blip_embs_vids.py # This will compute the embeddings for the WebVid-CoVR videos.
python tools/embs/save_blip_embs_imgs.py # This will compute the embeddings for the CIRR or FashionIQ images.

Training

The command to launch a training experiment is the folowing:

python train.py [OPTIONS]

The parsing is done by using the powerful Hydra library. You can override anything in the configuration by passing arguments like foo=value or foo.bar=value.

Evaluation

The command to evaluate is the folowing:

python test.py test=<test> [OPTIONS]

Options parameters

Datasets:

data=webvid-covr: WebVid-CoVR datasets.
data=cirr: CIRR dataset.
data=fashioniq-split: FashionIQ dataset, change split to dress, shirt or toptee.

Tests:

test=all: Test on WebVid-CoVR, CIRR and all three Fashion-IQ test sets.
test=webvid-covr: Test on WebVid-CoVR.
test=cirr: Test on CIRR.
test=fashioniq: Test on all three Fashion-IQ test sets (dress, shirt and toptee).

Checkpoints:

model/ckpt=blip-l-coco: Default checkpoint for BLIP-L finetuned on COCO.
model/ckpt=webvid-covr: Default checkpoint for CoVR finetuned on WebVid-CoVR.

Training

trainer=gpu: training with CUDA, change devices to the number of GPUs you want to use.
trainer=ddp: training with Distributed Data Parallel (DDP), change devices and num_nodes to the number of GPUs and number of nodes you want to use.
trainer=cpu: training on the CPU (not recommended).

Logging

trainer/logger=csv: log the results in a csv file. Very basic functionality.
trainer/logger=wandb: log the results in wandb. This requires to install wandb and to set up your wandb account. This is what we used to log our experiments.
trainer/logger=<other>: Other loggers (not tested).

Machine

machine=server: You can change the default path to the dataset folder and the batch size. You can create your own machine configuration by adding a new file in configs/machine.

Experiment

There are many pre-defined experiments from the paper in configs/experiments. Simply add experiment=<experiment> to the command line to use them.

SLURM setting

Use slurm_train.sh and slurm_test.sh in case of slurm setting.

Acknowledgements

We built our approach using CoVR-BLIP and BLIP using lightning-hydra-template in the backend.
To generate Video descriptions we used MiniGPT-4.

Citation

  @article{thawakar2024composed,
          title={Composed Video Retrieval via Enriched Context and Discriminative Embeddings},
          author={Omkar Thawakar and Muzammal Naseer and Rao Muhammad Anwer and Salman Khan and Michael Felsberg and Mubarak Shah and Fahad Shahbaz Khan},
          journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
          year={2024}
  }

OmkarThawakar / composed-video-retrieval

readme

🎥 ➕ 📝 ➡️ 🎥 Composed Video Retrieval via Enriched Context and Discriminative Embeddings [CVPR-2024]

Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah and Fahad Khan

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

Overview

Dataset

Generate Descriptions (optional)

Download webvid-covr annotations with our generated descriptions

Model Checkpoints

Installation

Usage :computer:

Training

Evaluation

Options parameters

Datasets:

Tests:

Checkpoints:

Training

Logging

Machine

Experiment

SLURM setting

Acknowledgements

Citation