To download the webvid-covr videos, install mpi4py
and run:
python tools/scripts/download_covr.py <split>
To download the annotations of webvid-covr:
bash tools/scripts/download_annotation.sh covr
To generate the descriptions of webvid-covr videos, use script tools/scripts/generate_webvid_description_2m.py
and tools/scripts/generate_webvid_description_8m.py
inside main directory of MiniGPT-4
Download the webvid-covr annotation files with our generated descriptions from here : OneDrive Link
Download the model checkpoints from here : OneDrive Link.
Save the checkpoint in folder structure : outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/
Final repository contains:
๐ฆ composed-video-retrieval
โฃ ๐ annotations
โฃ ๐ configs
โฃ ๐ datasets
โฃ ๐ outputs
โฃ ๐ src
โฃ ๐ tools
โฃ ๐ LICENSE
โฃ ๐ README.md
โฃ ๐ test.py
โ ๐ train.py
conda create --name covr
conda activate covr
Install the following packages inside the conda environment:
pip install -r requirements.txt
The code was tested on Python 3.10 and PyTorch >= 2.0.
Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run:
python tools/embs/save_blip_embs_vids.py # This will compute the embeddings for the WebVid-CoVR videos.
python tools/embs/save_blip_embs_imgs.py # This will compute the embeddings for the CIRR or FashionIQ images.
The command to launch a training experiment is the folowing:
python train.py [OPTIONS]
The parsing is done by using the powerful Hydra library. You can override anything in the configuration by passing arguments like foo=value
or foo.bar=value
.
The command to evaluate is the folowing:
python test.py test=<test> [OPTIONS]
data=webvid-covr
: WebVid-CoVR datasets.data=cirr
: CIRR dataset.data=fashioniq-split
: FashionIQ dataset, change split
to dress
, shirt
or toptee
.test=all
: Test on WebVid-CoVR, CIRR and all three Fashion-IQ test sets.test=webvid-covr
: Test on WebVid-CoVR.test=cirr
: Test on CIRR.test=fashioniq
: Test on all three Fashion-IQ test sets (dress
, shirt
and toptee
).model/ckpt=blip-l-coco
: Default checkpoint for BLIP-L finetuned on COCO.model/ckpt=webvid-covr
: Default checkpoint for CoVR finetuned on WebVid-CoVR.trainer=gpu
: training with CUDA, change devices
to the number of GPUs you want to use.trainer=ddp
: training with Distributed Data Parallel (DDP), change devices
and num_nodes
to the number of GPUs and number of nodes you want to use.trainer=cpu
: training on the CPU (not recommended).trainer/logger=csv
: log the results in a csv file. Very basic functionality.trainer/logger=wandb
: log the results in wandb. This requires to install wandb
and to set up your wandb account. This is what we used to log our experiments.trainer/logger=<other>
: Other loggers (not tested).machine=server
: You can change the default path to the dataset folder and the batch size. You can create your own machine configuration by adding a new file in configs/machine
.There are many pre-defined experiments from the paper in configs/experiments
. Simply add experiment=<experiment>
to the command line to use them.
Use slurm_train.sh
and slurm_test.sh
in case of slurm setting.
@article{thawakar2024composed,
title={Composed Video Retrieval via Enriched Context and Discriminative Embeddings},
author={Omkar Thawakar and Muzammal Naseer and Rao Muhammad Anwer and Salman Khan and Michael Felsberg and Mubarak Shah and Fahad Shahbaz Khan},
journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}