albanie / collaborative-experts

Video embeddings for retrieval with natural language queries
https://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/
Apache License 2.0
336 stars 55 forks source link
deep-neural-networks video-retrieval

This repo provides code:

Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

TeachText

TeachText diagram

TeachText results on MSRVTT Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE Full t2v 11.0(0.0) 30.8(0.1) 43.3(0.3) 73.1(0.2) 15.0(0.0) 81.8(0.2) 24.4(0.1) config_TT, model_TT, log_TT
CE+ Full t2v 13.8(0.1) 36.5(0.2) 49.4(0.4) 77.6(0.2) 11.0(0.0) 69.4(0.8) 29.2(0.2) config_TT, model_TT, log_TT
TeachText - CE Full t2v 11.8(0.1) 32.7(0.2) 45.3(0.2) 74.9(0.1) 13.0(0.0) 74.9(0.4) 25.9(0.1) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 14.6(0.0) 37.9(0.1) 50.9(0.2) 78.9(0.0) 10.0(0.0) 63.1(0.2) 30.4(0.0) config_TT, model_TT, log_TT

Please note that the numbers are higher than in the original CE due to compression artefacts correction

Denoising results on MSRVTT

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE+ Full t2v 14.4(0.1) 37.4(0.2) 50.2(0.1) 77.9(0.1) 10.0(0.0) 70.8(0.1) 30.0(0.1) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 14.9(0.1) 38.3(0.1) 51.5(0.1) 79.2(0.1) 10.0(0.0) 62.5(0.5) 30.9(0.1) config_TT, model_TT, log_TT

TeachText results on MSVD Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE Full t2v 21.5(0.6) 52.3(0.9) 67.5(0.8) 90.7(0.0) 5.0(0.0) 20.4(0.0) 42.3(0.6) config_TT, model_TT, log_TT
CE+ Full t2v 25.1(0.9) 56.5(1.4) 70.9(1.6) 92.4(0.5) 4.0(0.0) 17.8(0.6) 46.5(1.0) config_TT, model_TT, log_TT
TeachText - CE Full t2v 22.1(0.5) 52.2(0.6) 67.2(0.8) 91.2(0.5) 5.0(0.0) 19.6(0.5) 42.6(0.4) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 25.1(0.6) 56.8(0.6) 71.2(0.6) 92.7(0.3) 4.0(0.0) 16.8(0.3) 46.6(0.5) config_TT, model_TT, log_TT

Denoising results on MSVD

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE+ Full t2v 26.2(0.5) 57.7(1.0) 72.2(1.2) 92.2(0.4) 4.0(0.0) 17.9(0.5) 47.8(0.6) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 25.4(0.4) 56.9(0.5) 71.3(0.3) 92.8(0.2) 4.0(0.0) 16.7(0.2) 46.9(0.3) config_TT, model_TT, log_TT

TeachText results on DiDeMo Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE Full t2v 17.1(0.9) 41.9(0.2) 56.0(0.5) 83.4(0.9) 8.0(0.0) 42.8(2.8) 34.2(0.4) config_TT, model_TT, log_TT
CE+ Full t2v 18.2(0.3) 43.9(1.1) 57.1(0.9) 84.0(1.6) 7.9(0.1) 38.5(3.4) 35.8(0.4) config_TT, model_TT, log_TT
TeachText - CE Full t2v 21.0(0.7) 47.5(1.1) 61.9(0.6) 86.4(1.0) 6.0(0.0) 35.1(1.0) 39.5(0.5) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 21.6(0.8) 48.6(0.5) 62.9(0.7) 86.8(0.3) 6.0(0.0) 31.5(0.8) 40.4(0.4) config_TT, model_TT, log_TT

TeachText results on LSMDC Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE Full t2v 12.4(0.7) 28.5(0.8) 37.9(0.6) 64.5(0.8) 21.7(0.6) 88.0(4.8) 23.7(0.3) config_TT, model_TT, log_TT
CE+ Full t2v 14.9(0.7) 33.7(0.2) 44.1(0.7) 67.3(0.8) 15.3(0.6) 77.8(6.7) 28.1(0.3) config_TT, model_TT, log_TT
TeachText - CE Full t2v 13.7(0.9) 30.2(0.4) 40.1(0.4) 66.0(0.6) 19.8(1.3) 84.0(1.8) 25.5(0.5) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 17.2(0.5) 36.5(0.7) 46.3(0.4) 68.8(0.4) 13.7(0.6) 72.3(0.1) 30.7(0.3) config_TT, model_TT, log_TT

TeachText results on Activity-Net Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Geom Links
CE Full t2v 19.9(0.4) 50.1(0.8) 66.1(0.6) 92.2(0.7) 5.3(0.6) 21.3(1.1) 40.4(0.3) config_TT, model_TT, log_TT
CE+ Full t2v 19.4(0.2) 49.3(0.5) 65.4(0.4) 92.1(0.2) 6.0(0.0) 22.5(0.4) 39.7(0.0) config_TT, model_TT, log_TT
TeachText - CE Full t2v 22.7(0.8) 56.2(0.1) 71.6(0.8) 95.3(0.1) 4.0(0.0) 15.8(0.1) 45.0(0.6) config_TT, model_TT, log_TT
TeachText - CE+ Full t2v 23.5(0.2) 57.2(0.6) 73.6(0.2) 96.1(0.1) 4.0(0.0) 13.7(0.1) 46.3(0.2) config_TT, model_TT, log_TT

You can download the high quality features used for TeachText from:

For MSRVTT:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-MSRVTT-experts.tar.gz
sha1sum: 734650c3b98509996da75cdedc12101836624917

For MSVD:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-MSVD-experts.tar.gz
sha1sum: c8eba8c5291dd6bb501757ed0cc327cd22217965

For DiDeMo:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-DiDeMo-experts.tar.gz
sha1sum: 8e128309f12cf3260fe538f82578b5ad91a46bd0

For ActivityNet:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-activity-net-experts.tar.gz
sha1sum: 2f3c7c2fe86bd6d0c6230464a940c429291a4012

Collaborative Experts

CE diagram

High-level Overview: The Collaborative Experts framework aims to achieve robustness through two mechanisms:

  1. The use of information from a wide range of modalities, including those that are typically always available in video (such as RGB) as well as more "specific" clues which may only occasionally be present (such as overlaid text).
  2. A module that aims to combine these modalities into a fixed size representation that in a manner that is robust to noise.

Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

Important: A note on the updated results: A previous version of the codebase (and paper) reported results on the retrieval benchmarks that included a signficant software bug leading to an overestimate of performance. We are extremely grateful to Valentin Gabeur who discovered this bug (it has been corrected in the current codebase).

CVPR 2020: Pentathlon challenge

logo

We are hosting a video retrieval challenge as part of the Video Pentathlon Workshop. Find out how to participate here!

Pretrained video embeddings

We provide pretrained models for each dataset to reproduce the results reported in the paper [1] (references follow at the end of this README). Each model is accompanied by training and evaluation logs. Performance is evalauted for retrieval in both directions (joint-embeddings can be used for either of these two tasks):

In the results reported below, the same model is used for both the t2v and v2t evaluations. Each metric is reported as the mean and standard deviation (in parentheses) across three training runs.

MSRVTT Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Links
CE Full t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 71.4(0.1) 16.0(0.0) 86.8(0.3) config, model, log
CE 1k-A t2v 20.9(1.2) 48.8(0.6) 62.4(0.8) 89.1(0.4) 6.0(0.0) 28.2(0.8) config, model, log
CE 1k-B t2v 18.2(0.7) 46.0(0.4) 60.7(0.2) 86.6(0.5) 7.0(0.0) 35.3(1.1) config, model, log
MoEE* 1k-B t2v 15.0(0.7) 39.7(1.0) 54.5(1.1) 82.7(0.6) 8.3(0.6) 43.7(0.7) config, model, log
CE Full v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 84.0(0.1) 8.3(0.6) 38.1(1.8) config, model, log
CE 1k-A v2t 20.6(0.6) 50.3(0.5) 64.0(0.2) 89.9(0.3) 5.3(0.6) 25.1(0.8) config, model, log
CE 1k-B v2t 18.0(0.8) 46.0(0.5) 60.3(0.5) 86.4(0.3) 6.5(0.5) 30.6(1.2) config, model, log
MoEE* 1k-B v2t 14.5(0.8) 40.4(0.8) 54.9(1.0) 83.8(0.5) 8.8(0.4) 38.7(0.9) config, model, log

Models marked with * use the features made available with the MoEE model of [2] (without OCR, speech and scene features), unstarred models on the 1k-B and Full splits make use of OCR, speech and scene features, as well slightly stronger text encodings (GPT, rather than word2vec - see [1] for details). The MoEE model is implemented as a sanity check that our codebase approximately reproduces [2] (the MoEE paper).

See the MSRVTT README for links to the train/val/test lists of each split.

MSVD Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 19.8(0.3) 49.0(0.3) 63.8(0.1) 89.0(0.2) 6.0(0.0) 23.1(0.3) config, model, log
CE v2t 23.9(1.4) 50.2(0.8) 59.6(1.2) 82.3(0.7) 5.6(0.5) 41.2(3.4) config, model, log

See the MSVD README for descriptions of the train/test splits. Note that the videos in the MSVD dataset do not have soundtracks.

DiDeMo Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 16.1(1.4) 41.1(0.4) 54.4(0.8) 82.7(0.3) 8.3(0.6) 43.7(3.6) config, model, log
CE v2t 15.6(1.3) 40.9(0.4) 55.2(0.5) 82.2(1.3) 8.2(0.3) 42.4(3.3) config, model, log

See the DiDeMo README for descriptions of the train/val/test splits.

ActivityNet Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 18.2(0.3) 47.7(0.6) 63.9(0.5) 91.4(0.4) 6.0(0.0) 23.1(0.5) config, model, log
CE v2t 17.7(0.6) 46.6(0.7) 62.8(0.4) 90.9(0.2) 6.0(0.0) 24.4(0.5) config, model, log

See the ActivityNet README for descriptions of the train/test splits.

LSMDC Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 11.2(0.4) 26.9(1.1) 34.8(2.0) 62.1(1.5) 25.3(3.1) 96.8(5.0) config, model, log
CE v2t 11.7(0.5) 25.8(1.5) 34.4(1.7) 61.4(0.7) 28.0(2.6) 97.6(2.8) config, model, log

See the LSMDC README for descriptions of the train/test splits. Please note that to obtain the features and descriptions for this dataset, you must obtain permission from MPII to use the data (this is process is described here. Once you have done so, please request that a member of the LSMDC team contacts us to confirm approval (via albanie at robots dot ox dot ac dot uk) - we can then provide you with a link to the features.

Ablation studies on MSRVTT

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSRVTT dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model Task R@1 R@5 R@10 MdR Params Links
Concat t2v 0.0(0.0) 0.0(0.0) 0.0(0.0) 1495.5(0.0) 369.72k config, model, log
CE - MW,P,CG t2v 8.5(0.1) 25.9(0.3) 37.6(0.2) 19.0(0.0) 246.22M config, model, log
CE - P,CG t2v 9.6(0.1) 28.0(0.2) 39.7(0.2) 17.7(0.6) 400.41M config, model, log
CE - CG t2v 9.7(0.1) 28.1(0.2) 40.2(0.1) 17.0(0.0) 181.07M config, model, log
CE t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
Concat v2t 0.0(0.0) 0.0(0.0) 0.0(0.0) 29897.5(0.0) 369.72k config, model, log
CE - MW,P,CG v2t 13.7(0.4) 38.8(1.2) 53.1(1.1) 9.2(0.8) 246.22M config, model, log
CE - P,CG v2t 14.1(0.2) 39.5(1.0) 53.2(0.3) 9.0(0.0) 400.41M config, model, log
CE - CG v2t 15.1(0.3) 40.3(0.5) 54.3(0.7) 8.8(0.3) 181.07M config, model, log
CE v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 4.0(0.1) 14.1(0.1) 22.4(0.3) 50.0(1.0) 19.46M config, model, log
Scene + Inst. t2v 7.2(0.1) 22.3(0.3) 33.0(0.2) 25.3(0.6) 41.12M config, model, log
Scene + r2p1d t2v 6.8(0.1) 21.7(0.1) 32.4(0.1) 25.7(0.6) 39.95M config, model, log
Scene + RGB t2v 5.0(0.2) 16.6(0.7) 25.5(1.0) 40.7(2.1) 41.12M config, model, log
Scene + Flow t2v 5.3(0.3) 17.6(0.8) 27.1(0.9) 36.0(1.7) 40.34M config, model, log
Scene + Audio t2v 5.6(0.0) 18.7(0.1) 28.2(0.1) 33.7(0.6) 40.34M config, model, log
Scene + OCR t2v 4.1(0.1) 14.1(0.1) 22.2(0.2) 50.3(1.2) 49.49M config, model, log
Scene + Speech t2v 4.6(0.1) 15.5(0.2) 24.4(0.2) 44.7(1.2) 43.94M config, model, log
Scene + Face t2v 4.1(0.1) 14.2(0.3) 22.4(0.4) 49.7(0.6) 39.95M config, model, log
Scene v2t 5.6(0.6) 18.2(0.6) 27.7(0.3) 39.0(0.0) 19.46M config, model, log
Scene + Inst. v2t 10.1(0.3) 29.7(0.5) 41.9(0.7) 15.2(0.9) 41.12M config, model, log
Scene + r2p1d v2t 9.4(0.3) 27.8(0.6) 40.1(1.1) 17.2(1.1) 39.95M config, model, log
Scene + RGB v2t 6.9(0.5) 21.2(0.9) 31.1(1.9) 28.7(3.8) 41.12M config, model, log
Scene + Flow v2t 7.3(0.6) 22.3(1.4) 33.4(1.7) 25.2(2.0) 40.34M config, model, log
Scene + Audio v2t 8.2(0.4) 24.8(0.4) 36.0(0.1) 21.7(0.6) 40.34M config, model, log
Scene + OCR v2t 5.4(0.5) 18.6(1.2) 26.6(1.2) 40.0(1.0) 49.49M config, model, log
Scene + Speech v2t 6.0(0.2) 20.4(0.5) 30.3(1.0) 33.0(2.0) 43.94M config, model, log
Scene + Face v2t 5.6(1.0) 17.9(0.7) 26.7(0.8) 39.1(2.6) 39.95M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 4.0(0.1) 14.1(0.1) 22.4(0.3) 50.0(1.0) 19.46M config, model, log
Prev. + Speech t2v 4.6(0.1) 15.5(0.2) 24.4(0.2) 44.7(1.2) 43.94M config, model, log
Prev. + Audio t2v 5.8(0.1) 19.0(0.3) 28.8(0.2) 32.3(0.6) 62.45M config, model, log
Prev. + Flow t2v 6.7(0.2) 21.8(0.4) 32.5(0.5) 25.3(0.6) 80.96M config, model, log
Prev. + RGB t2v 7.5(0.1) 23.4(0.0) 34.1(0.2) 23.7(0.6) 100.26M config, model, log
Prev. + Inst t2v 9.5(0.2) 27.7(0.1) 39.4(0.1) 18.0(0.0) 119.56M config, model, log
Prev. + R2P1D t2v 9.9(0.1) 28.6(0.3) 40.7(0.1) 17.0(0.0) 137.67M config, model, log
Prev. + OCR t2v 10.0(0.1) 28.8(0.2) 40.9(0.2) 16.7(0.6) 165.33M config, model, log
Prev. + Face t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
Scene v2t 5.6(0.6) 18.2(0.6) 27.7(0.3) 39.0(0.0) 19.46M config, model, log
Prev. + Speech v2t 6.0(0.2) 20.4(0.5) 30.3(1.0) 33.0(2.0) 43.94M config, model, log
Prev. + Audio v2t 8.6(0.2) 26.1(0.6) 37.8(0.8) 19.8(0.8) 62.45M config, model, log
Prev. + Flow v2t 9.9(0.4) 28.6(0.7) 41.7(0.8) 15.7(0.6) 80.96M config, model, log
Prev. + RGB v2t 11.2(0.3) 32.1(0.8) 45.4(0.6) 13.7(0.6) 100.26M config, model, log
Prev. + Inst. v2t 14.7(0.6) 38.9(0.8) 53.1(1.0) 9.3(0.6) 119.56M config, model, log
Prev. + R2P1D v2t 15.5(0.6) 40.1(1.2) 54.4(1.3) 8.7(0.6) 137.67M config, model, log
Prev. + OCR v2t 15.2(0.1) 41.1(0.6) 54.6(0.7) 8.5(0.5) 165.33M config, model, log
Prev. + Face v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension Task R@1 R@5 R@10 MdR Params Links
384 t2v 9.4(0.2) 27.8(0.4) 39.8(0.4) 17.7(0.6) 88.62M config, model, log
512 t2v 9.8(0.3) 28.6(0.4) 40.6(0.4) 17.0(0.0) 119.51M config, model, log
640 t2v 10.1(0.1) 28.8(0.1) 40.9(0.2) 16.7(0.6) 151.12M config, model, log
768 t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
1024 t2v 9.9(0.1) 28.6(0.3) 40.7(0.4) 17.0(0.0) 250.27M config, model, log
384 v2t 14.0(0.5) 38.7(0.5) 52.7(1.4) 9.3(0.6) 88.62M config, model, log
512 v2t 14.8(0.4) 40.4(0.6) 53.9(0.4) 8.8(0.3) 119.51M config, model, log
640 v2t 15.6(0.6) 41.3(0.7) 55.0(0.5) 8.3(0.6) 151.12M config, model, log
768 v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log
1024 v2t 14.7(0.4) 40.7(0.8) 54.4(0.3) 8.5(0.5) 250.27M config, model, log

Training with more captions: Rather than varying the number of experts, we can also investigate how performance changes as we change the number of training captions available per-video.

Experts Caps. Task R@1 R@5 R@10 MdR Params Links
RGB 1 t2v 2.6(0.1) 9.3(0.4) 15.0(0.7) 101.3(15.5) 56.7M config, model, log
RGB 20 t2v 4.9(0.1) 16.5(0.2) 25.3(0.4) 40.7(1.2) 58.05M config, model, log
All 1 t2v 4.8(0.2) 16.2(0.5) 25.0(0.7) 43.3(4.0) 183.45M config, model, log
All 20 t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
RGB 1 v2t 3.7(0.3) 13.5(0.6) 20.8(0.4) 60.0(2.0) 56.7M config, model, log
RGB 20 v2t 6.9(0.6) 21.0(0.3) 31.3(0.3) 30.0(1.7) 58.05M config, model, log
All 1 v2t 8.4(0.5) 25.6(0.7) 37.1(0.2) 20.3(0.6) 183.45M config, model, log
All 20 v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log

Similar ablation studies for the remaining datasets can be found here.

Expert Zoo

For each dataset, the Collaborative Experts model makes use of a collection of pretrained "expert" feature extractors (see [1] for more precise descriptions). Some experts have been obtained from other sources (described where applicable), rather than extracted by us. To reproduce the experiments listed above, the experts for each dataset have been bundled into compressed tar files. These can be downloaded and unpacked with a utility script (recommended -- see example usage below), which will store them in the locations expected by the training code. Each set of experts has a brief README, which also provides a link from which they can be downloaded directly.

Dataset Experts Details and links Archive size sha1sum
MSRVTT audio, face, flow, ocr, rgb, scene, speech README 19.6 GiB 959bda588793ef05f348d16de26da84200c5a469
LSMDC audio, face, flow, ocr, rgb, scene README 6.1 GiB 7ce018e981752db9e793e449c2ba5bc88217373d
MSVD face, flow, ocr, rgb, scene README 2.1 GiB 6071827257c14de455b3a13fe1e885c2a7887c9e
DiDeMo audio, face, flow, ocr, rgb, scene, speech README 2.3 GiB 6fd4bcc68c1611052de2499fd8ab3f488c7c195b
ActivityNet audio, face, flow, ocr, rgb, scene, speech README 3.8 GiB b16685576c97cdec2783fb89ea30ca7d17abb021

QuerYD

MODEL study on QUERYD

Importance of the model:

Model Task R@1 R@5 R@10 R@50 MdR MnR Geom params Links
HowTo100m S3D t2v 13.5(0.0) 27.5(0.0) 34.5(0.0) 57.0(0.0) 35.0(0.0) 72.5(0.0) 23.4(0.0) 1 config, model, log
CE - P,CG t2v 11.6(1.3) 30.2(3.0) 43.2(3.1) 74.8(1.7) 14.2(1.6) 42.7(2.6) 24.7(1.9) 57.75M config, model, log
CE t2v 13.9(0.8) 37.6(1.2) 48.3(1.4) 78.8(0.7) 11.3(0.6) 35.1(1.6) 29.3(0.8) 30.82M config, model, log
HowTo100m S3D v2t 12.4(0.0) 23.8(0.0) 30.8(0.0) 57.0(0.0) 33.0(0.0) 73.4(0.0) 20.9(0.0) 1 config, model, log
CE - P,CG v2t 13.0(3.1) 30.9(2.0) 43.0(2.8) 73.2(0.1) 14.5(1.8) 42.6(1.5) 25.7(2.3) 57.75M config, model, log
CE v2t 13.7(0.7) 35.2(2.7) 46.9(3.2) 78.3(2.8) 12.3(1.5) 35.8(2.4) 28.3(1.5) 30.82M config, model, log

The influence of different pretrained experts for the performance of the CE model trained on QuerYD is studied. The value and cumulative effect of different experts for scene clas-sification (SCENE), ambient sound classification (AUDIO),image classification (OBJECT), and action recognition (ACTION) are presented. PREV. denotes the experts used in the previous row.

Ablation studies on QuerYD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.

Experts Task R@1 R@5 R@10 R@50 MdR MnR Geom params Links
Scene t2v 8.7(0.4) 26.3(1.1) 37.1(0.7) 68.5(2.2) 22.2(1.6) 52.3(3.0) 20.4(0.1) 7.51M config, model, log
Scene + Inst. t2v 11.7(1.4) 31.6(0.9) 43.4(1.3) 74.5(0.9) 14.0(1.0) 41.1(2.1) 25.2(0.8) 17.25M config, model, log
Scene + r2p1d t2v 11.7(2.1) 32.1(3.0) 45.3(3.3) 74.6(0.4) 13.7(1.9) 42.9(2.2) 25.7(2.4) 16.07M config, model, log
Scene + Audio t2v 7.6(2.7) 27.4(1.4) 40.4(0.9) 69.1(0.9) 17.0(1.7) 49.0(1.9) 20.2(2.3) 17.25M config, model, log
Scene v2t 9.1(0.8) 25.4(0.9) 35.3(1.5) 68.2(2.2) 23.2(0.3) 52.6(2.6) 20.1(0.5) 7.51M config, model, log
Scene + Inst. v2t 11.9(0.5) 31.0(3.6) 43.5(2.7) 74.8(1.8) 14.5(0.9) 40.8(2.1) 25.2(1.1) 17.25M config, model, log
Scene + r2p1d v2t 12.7(1.4) 30.9(2.8) 44.0(1.8) 74.3(1.2) 14.3(1.2) 42.8(1.7) 25.8(1.7) 16.07M config, model, log
Scene + Audio v2t 10.1(1.2) 25.7(1.5) 37.5(1.2) 69.8(1.6) 20.0(1.3) 48.9(2.0) 21.3(1.1) 17.25M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 R@50 MdR MnR Geom params Links
Scene t2v 8.7(0.4) 26.3(1.1) 37.1(0.7) 68.5(2.2) 22.2(1.6) 52.3(3.0) 20.4(0.1) 7.51M config, model, log
Prev. + Audio t2v 7.6(2.7) 27.4(1.4) 40.4(0.9) 69.1(0.9) 17.0(1.7) 49.0(1.9) 20.2(2.3) 17.25M config, model, log
Prev. + Inst t2v 12.7(1.7) 34.8(1.7) 47.0(1.3) 78.0(1.0) 12.3(0.6) 37.6(2.1) 27.5(1.5) 24.63M config, model, log
Prev. + R2P1D t2v 14.3(0.3) 37.5(1.3) 48.6(0.8) 78.8(0.3) 11.3(0.6) 35.2(1.8) 29.7(0.3) 30.82M config, model, log
Scene v2t 9.1(0.8) 25.4(0.9) 35.3(1.5) 68.2(2.2) 23.2(0.3) 52.6(2.6) 20.1(0.5) 7.51M config, model, log
Prev. + Audio v2t 10.1(1.2) 25.7(1.5) 37.5(1.2) 69.8(1.6) 20.0(1.3) 48.9(2.0) 21.3(1.1) 17.25M config, model, log
Prev. + Inst. v2t 12.8(1.3) 33.5(2.8) 46.6(1.0) 76.7(1.7) 11.8(0.8) 37.6(1.9) 27.1(0.6) 24.63M config, model, log
Prev. + R2P1D v2t 14.0(0.3) 35.4(2.9) 47.2(2.8) 78.7(2.4) 12.3(1.5) 35.8(2.4) 28.6(1.2) 30.82M config, model, log

QuerYDSegments

MODEL study on QUERYDSEGMENTS

Importance of the model:

Model Task R@1 R@5 R@10 R@50 MdR MnR Geom params Links
HowTo100m S3D t2v 6.7(0.0) 14.7(0.0) 20.4(0.0) 36.6(0.0) 133.0(0.0) 342.0(0.0) 12.6(0.0) 1 config, model, log
CE - P,CG t2v 19.0(0.8) 38.9(1.0) 47.9(0.7) 68.0(0.4) 12.0(1.0) 127.4(5.9) 32.8(0.6) 57.75M config, model, log
CE t2v 18.2(0.5) 38.1(0.8) 46.8(0.4) 67.3(0.7) 13.3(0.6) 127.5(3.9) 31.9(0.4) 30.82M config, model, log
HowTo100m S3D v2t 8.4(0.0) 15.4(0.0) 19.8(0.0) 34.2(0.0) 154.5(0.0) 363.0(0.0) 13.7(0.0) 1 config, model, log
CE - P,CG v2t 19.8(0.2) 39.6(0.6) 47.6(0.1) 67.9(0.5) 13.0(0.0) 124.3(5.5) 33.4(0.2) 57.75M config, model, log
CE v2t 18.1(0.6) 37.3(0.5) 45.9(0.6) 67.2(0.2) 14.0(1.0) 123.9(3.3) 31.4(0.4) 30.82M config, model, log

Evaluating a pretrained model

Evaluting a pretrained model for a given dataset requires:

  1. The pretrained experts for the target dataset, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
  2. A config.json file.
  3. A trained_model.pth file.

Evaluation is then performed with the following command:

python3 test.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the MSVD results described above, run the following sequence of commands:

# fetch the pretrained experts for MSVD 
python3 misc/sync_experts.py --dataset MSVD

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"

# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config

Training a new model

Training a new video-text embedding requires:

  1. The pretrained experts for the dataset used for training, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
  2. A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.

For example, to train a new embedding for the LSMDC dataset, run the following sequence of commands:

# fetch the pretrained experts for LSMDC 
python3 misc/sync_experts.py --dataset LSMDC

# Train the model
python3 train.py --config configs/lsmdc/train-full-ce.json --device 0

Visualising the retrieval ranking

Tensorboard lacks video support via HTML5 tags (at the time of writing) so after each evaluation of a retrieval model, a simple HTML file is generated to allow the predicted rankings of different videos to be visualised: an example screenshot is given below (this tool is inspired by the visualiser in the pix2pix codebase). To view the visualisation, navigate to the web directory (this is generated for each experiment, and will be printed in the log during training) and run python3 -m http.server 9999, then navigate to localhost:9999 in your web browser. You should see something like the following:

visualisation

Note that the visualising the results in this manner requires that you also download the source videos for each of the datasets to some directory <src-video-dir>. Then set the visualizer.args.src_video_dir attribute of the training config.json file to point to <src-video-dir>.

Dependencies

Dependencies can be installed via pip install -r requirements/pip-requirements.txt.

References

[1] If you find this code useful or use the extracted features, please consider citing:

@inproceedings{croitoru2021teachtext,
  title={Teachtext: Crossmodal generalized distillation for text-video retrieval},
  author={Croitoru, I. and Bogolin, S. and Leordeanu, M. and Jin, H. and Zisserman, A. and Albanie, S. and Liu, Y.},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={11583--11593},
  year={2021}
}

@inproceedings{Liu2019a,
  author    = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
  booktitle = {arXiv preprint arxiv:1907.13487},
  title     = {Use What You Have: Video retrieval using representations from collaborative experts},
  date      = {2019},
}

[2] If you make use of the MSRVTT or LSMDC features provided by Miech et al. (details are given in their respective READMEs here and here), please cite:

@article{miech2018learning,
  title={Learning a text-video embedding from incomplete and heterogeneous data},
  author={Miech, Antoine and Laptev, Ivan and Sivic, Josef},
  journal={arXiv preprint arXiv:1804.02516},
  year={2018}
}

Acknowledgements

This work was inspired by a number of prior works for learning joint embeddings of text and video, but in particular the Mixture-of-Embedding-Experts method proposed by Antoine Miech, Ivan Laptev and Josef Sivic (paper, code). We would also like to thank Zak Stone and Susie Lim for their help with using Cloud TPUs. The code structure uses the pytorch-template by @victoresque.