albanie/collaborative-experts

This repo provides code:

TeachText which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model using a generalize distillation setup (paper, project page)
Learning and evaluating joint video-text embeddings for the task of video retrieval. The approach is described in the paper "Use What You Have: Video retrieval using representations from collaborative experts" (paper, project page)
CVPR 2020 Pentathlon challenge

Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

TeachText

TeachText diagram

TeachText results on MSRVTT Benchmark

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE	Full	t2v	_{^11.0_(0.0)}	_{^30.8_(0.1)}	_{^43.3_(0.3)}	_{^73.1_(0.2)}	_{^15.0_(0.0)}	_{^81.8_(0.2)}	_{^24.4_(0.1)}	config_TT, model_TT, log_TT
CE+	Full	t2v	_{^13.8_(0.1)}	_{^36.5_(0.2)}	_{^49.4_(0.4)}	_{^77.6_(0.2)}	_{^11.0_(0.0)}	_{^69.4_(0.8)}	_{^29.2_(0.2)}	config_TT, model_TT, log_TT
TeachText - CE	Full	t2v	_{^11.8_(0.1)}	_{^32.7_(0.2)}	_{^45.3_(0.2)}	_{^74.9_(0.1)}	_{^13.0_(0.0)}	_{^74.9_(0.4)}	_{^25.9_(0.1)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^14.6_(0.0)}	_{^37.9_(0.1)}	_{^50.9_(0.2)}	_{^78.9_(0.0)}	_{^10.0_(0.0)}	_{^63.1_(0.2)}	_{^30.4_(0.0)}	config_TT, model_TT, log_TT

Please note that the numbers are higher than in the original CE due to compression artefacts correction

Denoising results on MSRVTT

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE+	Full	t2v	_{^14.4_(0.1)}	_{^37.4_(0.2)}	_{^50.2_(0.1)}	_{^77.9_(0.1)}	_{^10.0_(0.0)}	_{^70.8_(0.1)}	_{^30.0_(0.1)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^14.9_(0.1)}	_{^38.3_(0.1)}	_{^51.5_(0.1)}	_{^79.2_(0.1)}	_{^10.0_(0.0)}	_{^62.5_(0.5)}	_{^30.9_(0.1)}	config_TT, model_TT, log_TT

TeachText results on MSVD Benchmark

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE	Full	t2v	_{^21.5_(0.6)}	_{^52.3_(0.9)}	_{^67.5_(0.8)}	_{^90.7_(0.0)}	_{^5.0_(0.0)}	_{^20.4_(0.0)}	_{^42.3_(0.6)}	config_TT, model_TT, log_TT
CE+	Full	t2v	_{^25.1_(0.9)}	_{^56.5_(1.4)}	_{^70.9_(1.6)}	_{^92.4_(0.5)}	_{^4.0_(0.0)}	_{^17.8_(0.6)}	_{^46.5_(1.0)}	config_TT, model_TT, log_TT
TeachText - CE	Full	t2v	_{^22.1_(0.5)}	_{^52.2_(0.6)}	_{^67.2_(0.8)}	_{^91.2_(0.5)}	_{^5.0_(0.0)}	_{^19.6_(0.5)}	_{^42.6_(0.4)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^25.1_(0.6)}	_{^56.8_(0.6)}	_{^71.2_(0.6)}	_{^92.7_(0.3)}	_{^4.0_(0.0)}	_{^16.8_(0.3)}	_{^46.6_(0.5)}	config_TT, model_TT, log_TT

Denoising results on MSVD

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE+	Full	t2v	_{^26.2_(0.5)}	_{^57.7_(1.0)}	_{^72.2_(1.2)}	_{^92.2_(0.4)}	_{^4.0_(0.0)}	_{^17.9_(0.5)}	_{^47.8_(0.6)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^25.4_(0.4)}	_{^56.9_(0.5)}	_{^71.3_(0.3)}	_{^92.8_(0.2)}	_{^4.0_(0.0)}	_{^16.7_(0.2)}	_{^46.9_(0.3)}	config_TT, model_TT, log_TT

TeachText results on DiDeMo Benchmark

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE	Full	t2v	_{^17.1_(0.9)}	_{^41.9_(0.2)}	_{^56.0_(0.5)}	_{^83.4_(0.9)}	_{^8.0_(0.0)}	_{^42.8_(2.8)}	_{^34.2_(0.4)}	config_TT, model_TT, log_TT
CE+	Full	t2v	_{^18.2_(0.3)}	_{^43.9_(1.1)}	_{^57.1_(0.9)}	_{^84.0_(1.6)}	_{^7.9_(0.1)}	_{^38.5_(3.4)}	_{^35.8_(0.4)}	config_TT, model_TT, log_TT
TeachText - CE	Full	t2v	_{^21.0_(0.7)}	_{^47.5_(1.1)}	_{^61.9_(0.6)}	_{^86.4_(1.0)}	_{^6.0_(0.0)}	_{^35.1_(1.0)}	_{^39.5_(0.5)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^21.6_(0.8)}	_{^48.6_(0.5)}	_{^62.9_(0.7)}	_{^86.8_(0.3)}	_{^6.0_(0.0)}	_{^31.5_(0.8)}	_{^40.4_(0.4)}	config_TT, model_TT, log_TT

TeachText results on LSMDC Benchmark

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE	Full	t2v	_{^12.4_(0.7)}	_{^28.5_(0.8)}	_{^37.9_(0.6)}	_{^64.5_(0.8)}	_{^21.7_(0.6)}	_{^88.0_(4.8)}	_{^23.7_(0.3)}	config_TT, model_TT, log_TT
CE+	Full	t2v	_{^14.9_(0.7)}	_{^33.7_(0.2)}	_{^44.1_(0.7)}	_{^67.3_(0.8)}	_{^15.3_(0.6)}	_{^77.8_(6.7)}	_{^28.1_(0.3)}	config_TT, model_TT, log_TT
TeachText - CE	Full	t2v	_{^13.7_(0.9)}	_{^30.2_(0.4)}	_{^40.1_(0.4)}	_{^66.0_(0.6)}	_{^19.8_(1.3)}	_{^84.0_(1.8)}	_{^25.5_(0.5)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^17.2_(0.5)}	_{^36.5_(0.7)}	_{^46.3_(0.4)}	_{^68.8_(0.4)}	_{^13.7_(0.6)}	_{^72.3_(0.1)}	_{^30.7_(0.3)}	config_TT, model_TT, log_TT

TeachText results on Activity-Net Benchmark

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	Links
CE	Full	t2v	_{^19.9_(0.4)}	_{^50.1_(0.8)}	_{^66.1_(0.6)}	_{^92.2_(0.7)}	_{^5.3_(0.6)}	_{^21.3_(1.1)}	_{^40.4_(0.3)}	config_TT, model_TT, log_TT
CE+	Full	t2v	_{^19.4_(0.2)}	_{^49.3_(0.5)}	_{^65.4_(0.4)}	_{^92.1_(0.2)}	_{^6.0_(0.0)}	_{^22.5_(0.4)}	_{^39.7_(0.0)}	config_TT, model_TT, log_TT
TeachText - CE	Full	t2v	_{^22.7_(0.8)}	_{^56.2_(0.1)}	_{^71.6_(0.8)}	_{^95.3_(0.1)}	_{^4.0_(0.0)}	_{^15.8_(0.1)}	_{^45.0_(0.6)}	config_TT, model_TT, log_TT
TeachText - CE+	Full	t2v	_{^23.5_(0.2)}	_{^57.2_(0.6)}	_{^73.6_(0.2)}	_{^96.1_(0.1)}	_{^4.0_(0.0)}	_{^13.7_(0.1)}	_{^46.3_(0.2)}	config_TT, model_TT, log_TT

You can download the high quality features used for TeachText from:

For MSRVTT:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-MSRVTT-experts.tar.gz
sha1sum: 734650c3b98509996da75cdedc12101836624917

For MSVD:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-MSVD-experts.tar.gz
sha1sum: c8eba8c5291dd6bb501757ed0cc327cd22217965

For DiDeMo:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-DiDeMo-experts.tar.gz
sha1sum: 8e128309f12cf3260fe538f82578b5ad91a46bd0

For ActivityNet:
http:/www.robots.ox.ac.uk/~vgg/research/teachtext/data-hq/high-quality/high-quality-activity-net-experts.tar.gz
sha1sum: 2f3c7c2fe86bd6d0c6230464a940c429291a4012

Collaborative Experts

CE diagram

High-level Overview: The Collaborative Experts framework aims to achieve robustness through two mechanisms:

The use of information from a wide range of modalities, including those that are typically always available in video (such as RGB) as well as more "specific" clues which may only occasionally be present (such as overlaid text).
A module that aims to combine these modalities into a fixed size representation that in a manner that is robust to noise.

Important: A note on the updated results: A previous version of the codebase (and paper) reported results on the retrieval benchmarks that included a signficant software bug leading to an overestimate of performance. We are extremely grateful to Valentin Gabeur who discovered this bug (it has been corrected in the current codebase).

CVPR 2020: Pentathlon challenge

We are hosting a video retrieval challenge as part of the Video Pentathlon Workshop. Find out how to participate here!

Pretrained video embeddings

We provide pretrained models for each dataset to reproduce the results reported in the paper [1] (references follow at the end of this README). Each model is accompanied by training and evaluation logs. Performance is evalauted for retrieval in both directions (joint-embeddings can be used for either of these two tasks):

t2v denotes that a text query is used to retrieve videos
v2t denotes that a video query is used to retrieve text video descriptions

In the results reported below, the same model is used for both the t2v and v2t evaluations. Each metric is reported as the mean and standard deviation (in parentheses) across three training runs.

MSRVTT Benchmark

Model	Split	Task	R@1	R@5	R@10	R@50	MdR	MnR	Links
CE	Full	t2v	_{^10.0_(0.1)}	_{^29.0_(0.3)}	_{^41.2_(0.2)}	_{^71.4_(0.1)}	_{^16.0_(0.0)}	_{^86.8_(0.3)}	config, model, log
CE	1k-A	t2v	_{^20.9_(1.2)}	_{^48.8_(0.6)}	_{^62.4_(0.8)}	_{^89.1_(0.4)}	_{^6.0_(0.0)}	_{^28.2_(0.8)}	config, model, log
CE	1k-B	t2v	_{^18.2_(0.7)}	_{^46.0_(0.4)}	_{^60.7_(0.2)}	_{^86.6_(0.5)}	_{^7.0_(0.0)}	_{^35.3_(1.1)}	config, model, log
MoEE*	1k-B	t2v	_{^15.0_(0.7)}	_{^39.7_(1.0)}	_{^54.5_(1.1)}	_{^82.7_(0.6)}	_{^8.3_(0.6)}	_{^43.7_(0.7)}	config, model, log
CE	Full	v2t	_{^15.6_(0.3)}	_{^40.9_(1.4)}	_{^55.2_(1.0)}	_{^84.0_(0.1)}	_{^8.3_(0.6)}	_{^38.1_(1.8)}	config, model, log
CE	1k-A	v2t	_{^20.6_(0.6)}	_{^50.3_(0.5)}	_{^64.0_(0.2)}	_{^89.9_(0.3)}	_{^5.3_(0.6)}	_{^25.1_(0.8)}	config, model, log
CE	1k-B	v2t	_{^18.0_(0.8)}	_{^46.0_(0.5)}	_{^60.3_(0.5)}	_{^86.4_(0.3)}	_{^6.5_(0.5)}	_{^30.6_(1.2)}	config, model, log
MoEE*	1k-B	v2t	_{^14.5_(0.8)}	_{^40.4_(0.8)}	_{^54.9_(1.0)}	_{^83.8_(0.5)}	_{^8.8_(0.4)}	_{^38.7_(0.9)}	config, model, log

Models marked with * use the features made available with the MoEE model of [2] (without OCR, speech and scene features), unstarred models on the 1k-B and Full splits make use of OCR, speech and scene features, as well slightly stronger text encodings (GPT, rather than word2vec - see [1] for details). The MoEE model is implemented as a sanity check that our codebase approximately reproduces [2] (the MoEE paper).

See the MSRVTT README for links to the train/val/test lists of each split.

MSVD Benchmark

Model	Task	R@1	R@5	R@10	R@50	MdR	MnR	Links
CE	t2v	_{^19.8_(0.3)}	_{^49.0_(0.3)}	_{^63.8_(0.1)}	_{^89.0_(0.2)}	_{^6.0_(0.0)}	_{^23.1_(0.3)}	config, model, log
CE	v2t	_{^23.9_(1.4)}	_{^50.2_(0.8)}	_{^59.6_(1.2)}	_{^82.3_(0.7)}	_{^5.6_(0.5)}	_{^41.2_(3.4)}	config, model, log

See the MSVD README for descriptions of the train/test splits. Note that the videos in the MSVD dataset do not have soundtracks.

DiDeMo Benchmark

Model	Task	R@1	R@5	R@10	R@50	MdR	MnR	Links
CE	t2v	_{^16.1_(1.4)}	_{^41.1_(0.4)}	_{^54.4_(0.8)}	_{^82.7_(0.3)}	_{^8.3_(0.6)}	_{^43.7_(3.6)}	config, model, log
CE	v2t	_{^15.6_(1.3)}	_{^40.9_(0.4)}	_{^55.2_(0.5)}	_{^82.2_(1.3)}	_{^8.2_(0.3)}	_{^42.4_(3.3)}	config, model, log

See the DiDeMo README for descriptions of the train/val/test splits.

ActivityNet Benchmark

Model	Task	R@1	R@5	R@10	R@50	MdR	MnR	Links
CE	t2v	_{^18.2_(0.3)}	_{^47.7_(0.6)}	_{^63.9_(0.5)}	_{^91.4_(0.4)}	_{^6.0_(0.0)}	_{^23.1_(0.5)}	config, model, log
CE	v2t	_{^17.7_(0.6)}	_{^46.6_(0.7)}	_{^62.8_(0.4)}	_{^90.9_(0.2)}	_{^6.0_(0.0)}	_{^24.4_(0.5)}	config, model, log

See the ActivityNet README for descriptions of the train/test splits.

LSMDC Benchmark

Model	Task	R@1	R@5	R@10	R@50	MdR	MnR	Links
CE	t2v	_{^11.2_(0.4)}	_{^26.9_(1.1)}	_{^34.8_(2.0)}	_{^62.1_(1.5)}	_{^25.3_(3.1)}	_{^96.8_(5.0)}	config, model, log
CE	v2t	_{^11.7_(0.5)}	_{^25.8_(1.5)}	_{^34.4_(1.7)}	_{^61.4_(0.7)}	_{^28.0_(2.6)}	_{^97.6_(2.8)}	config, model, log

See the LSMDC README for descriptions of the train/test splits. Please note that to obtain the features and descriptions for this dataset, you must obtain permission from MPII to use the data (this is process is described here. Once you have done so, please request that a member of the LSMDC team contacts us to confirm approval (via albanie at robots dot ox dot ac dot uk) - we can then provide you with a link to the features.

Ablation studies on MSRVTT

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSRVTT dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model	Task	R@1	R@5	R@10	MdR	Params	Links
Concat	t2v	_{^0.0_(0.0)}	_{^0.0_(0.0)}	_{^0.0_(0.0)}	_{^1495.5_(0.0)}	369.72k	config, model, log
CE - MW,P,CG	t2v	_{^8.5_(0.1)}	_{^25.9_(0.3)}	_{^37.6_(0.2)}	_{^19.0_(0.0)}	246.22M	config, model, log
CE - P,CG	t2v	_{^9.6_(0.1)}	_{^28.0_(0.2)}	_{^39.7_(0.2)}	_{^17.7_(0.6)}	400.41M	config, model, log
CE - CG	t2v	_{^9.7_(0.1)}	_{^28.1_(0.2)}	_{^40.2_(0.1)}	_{^17.0_(0.0)}	181.07M	config, model, log
CE	t2v	_{^10.0_(0.1)}	_{^29.0_(0.3)}	_{^41.2_(0.2)}	_{^16.0_(0.0)}	183.45M	config, model, log
Concat	v2t	_{^0.0_(0.0)}	_{^0.0_(0.0)}	_{^0.0_(0.0)}	_{^{29897.5_(0.0)}}	369.72k	config, model, log
CE - MW,P,CG	v2t	_{^13.7_(0.4)}	_{^38.8_(1.2)}	_{^53.1_(1.1)}	_{^9.2_(0.8)}	246.22M	config, model, log
CE - P,CG	v2t	_{^14.1_(0.2)}	_{^39.5_(1.0)}	_{^53.2_(0.3)}	_{^9.0_(0.0)}	400.41M	config, model, log
CE - CG	v2t	_{^15.1_(0.3)}	_{^40.3_(0.5)}	_{^54.3_(0.7)}	_{^8.8_(0.3)}	181.07M	config, model, log
CE	v2t	_{^15.6_(0.3)}	_{^40.9_(1.4)}	_{^55.2_(1.0)}	_{^8.3_(0.6)}	183.45M	config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
CE - CG - The CE model without Collaborative Gating (CG).
CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^4.0_(0.1)}	_{^14.1_(0.1)}	_{^22.4_(0.3)}	_{^50.0_(1.0)}	19.46M	config, model, log
Scene + Inst.	t2v	_{^7.2_(0.1)}	_{^22.3_(0.3)}	_{^33.0_(0.2)}	_{^25.3_(0.6)}	41.12M	config, model, log
Scene + r2p1d	t2v	_{^6.8_(0.1)}	_{^21.7_(0.1)}	_{^32.4_(0.1)}	_{^25.7_(0.6)}	39.95M	config, model, log
Scene + RGB	t2v	_{^5.0_(0.2)}	_{^16.6_(0.7)}	_{^25.5_(1.0)}	_{^40.7_(2.1)}	41.12M	config, model, log
Scene + Flow	t2v	_{^5.3_(0.3)}	_{^17.6_(0.8)}	_{^27.1_(0.9)}	_{^36.0_(1.7)}	40.34M	config, model, log
Scene + Audio	t2v	_{^5.6_(0.0)}	_{^18.7_(0.1)}	_{^28.2_(0.1)}	_{^33.7_(0.6)}	40.34M	config, model, log
Scene + OCR	t2v	_{^4.1_(0.1)}	_{^14.1_(0.1)}	_{^22.2_(0.2)}	_{^50.3_(1.2)}	49.49M	config, model, log
Scene + Speech	t2v	_{^4.6_(0.1)}	_{^15.5_(0.2)}	_{^24.4_(0.2)}	_{^44.7_(1.2)}	43.94M	config, model, log
Scene + Face	t2v	_{^4.1_(0.1)}	_{^14.2_(0.3)}	_{^22.4_(0.4)}	_{^49.7_(0.6)}	39.95M	config, model, log
Scene	v2t	_{^5.6_(0.6)}	_{^18.2_(0.6)}	_{^27.7_(0.3)}	_{^39.0_(0.0)}	19.46M	config, model, log
Scene + Inst.	v2t	_{^10.1_(0.3)}	_{^29.7_(0.5)}	_{^41.9_(0.7)}	_{^15.2_(0.9)}	41.12M	config, model, log
Scene + r2p1d	v2t	_{^9.4_(0.3)}	_{^27.8_(0.6)}	_{^40.1_(1.1)}	_{^17.2_(1.1)}	39.95M	config, model, log
Scene + RGB	v2t	_{^6.9_(0.5)}	_{^21.2_(0.9)}	_{^31.1_(1.9)}	_{^28.7_(3.8)}	41.12M	config, model, log
Scene + Flow	v2t	_{^7.3_(0.6)}	_{^22.3_(1.4)}	_{^33.4_(1.7)}	_{^25.2_(2.0)}	40.34M	config, model, log
Scene + Audio	v2t	_{^8.2_(0.4)}	_{^24.8_(0.4)}	_{^36.0_(0.1)}	_{^21.7_(0.6)}	40.34M	config, model, log
Scene + OCR	v2t	_{^5.4_(0.5)}	_{^18.6_(1.2)}	_{^26.6_(1.2)}	_{^40.0_(1.0)}	49.49M	config, model, log
Scene + Speech	v2t	_{^6.0_(0.2)}	_{^20.4_(0.5)}	_{^30.3_(1.0)}	_{^33.0_(2.0)}	43.94M	config, model, log
Scene + Face	v2t	_{^5.6_(1.0)}	_{^17.9_(0.7)}	_{^26.7_(0.8)}	_{^39.1_(2.6)}	39.95M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	MdR	Params	Links
Scene	t2v	_{^4.0_(0.1)}	_{^14.1_(0.1)}	_{^22.4_(0.3)}	_{^50.0_(1.0)}	19.46M	config, model, log
Prev. + Speech	t2v	_{^4.6_(0.1)}	_{^15.5_(0.2)}	_{^24.4_(0.2)}	_{^44.7_(1.2)}	43.94M	config, model, log
Prev. + Audio	t2v	_{^5.8_(0.1)}	_{^19.0_(0.3)}	_{^28.8_(0.2)}	_{^32.3_(0.6)}	62.45M	config, model, log
Prev. + Flow	t2v	_{^6.7_(0.2)}	_{^21.8_(0.4)}	_{^32.5_(0.5)}	_{^25.3_(0.6)}	80.96M	config, model, log
Prev. + RGB	t2v	_{^7.5_(0.1)}	_{^23.4_(0.0)}	_{^34.1_(0.2)}	_{^23.7_(0.6)}	100.26M	config, model, log
Prev. + Inst	t2v	_{^9.5_(0.2)}	_{^27.7_(0.1)}	_{^39.4_(0.1)}	_{^18.0_(0.0)}	119.56M	config, model, log
Prev. + R2P1D	t2v	_{^9.9_(0.1)}	_{^28.6_(0.3)}	_{^40.7_(0.1)}	_{^17.0_(0.0)}	137.67M	config, model, log
Prev. + OCR	t2v	_{^10.0_(0.1)}	_{^28.8_(0.2)}	_{^40.9_(0.2)}	_{^16.7_(0.6)}	165.33M	config, model, log
Prev. + Face	t2v	_{^10.0_(0.1)}	_{^29.0_(0.3)}	_{^41.2_(0.2)}	_{^16.0_(0.0)}	183.45M	config, model, log
Scene	v2t	_{^5.6_(0.6)}	_{^18.2_(0.6)}	_{^27.7_(0.3)}	_{^39.0_(0.0)}	19.46M	config, model, log
Prev. + Speech	v2t	_{^6.0_(0.2)}	_{^20.4_(0.5)}	_{^30.3_(1.0)}	_{^33.0_(2.0)}	43.94M	config, model, log
Prev. + Audio	v2t	_{^8.6_(0.2)}	_{^26.1_(0.6)}	_{^37.8_(0.8)}	_{^19.8_(0.8)}	62.45M	config, model, log
Prev. + Flow	v2t	_{^9.9_(0.4)}	_{^28.6_(0.7)}	_{^41.7_(0.8)}	_{^15.7_(0.6)}	80.96M	config, model, log
Prev. + RGB	v2t	_{^11.2_(0.3)}	_{^32.1_(0.8)}	_{^45.4_(0.6)}	_{^13.7_(0.6)}	100.26M	config, model, log
Prev. + Inst.	v2t	_{^14.7_(0.6)}	_{^38.9_(0.8)}	_{^53.1_(1.0)}	_{^9.3_(0.6)}	119.56M	config, model, log
Prev. + R2P1D	v2t	_{^15.5_(0.6)}	_{^40.1_(1.2)}	_{^54.4_(1.3)}	_{^8.7_(0.6)}	137.67M	config, model, log
Prev. + OCR	v2t	_{^15.2_(0.1)}	_{^41.1_(0.6)}	_{^54.6_(0.7)}	_{^8.5_(0.5)}	165.33M	config, model, log
Prev. + Face	v2t	_{^15.6_(0.3)}	_{^40.9_(1.4)}	_{^55.2_(1.0)}	_{^8.3_(0.6)}	183.45M	config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension	Task	R@1	R@5	R@10	MdR	Params	Links
384	t2v	_{^9.4_(0.2)}	_{^27.8_(0.4)}	_{^39.8_(0.4)}	_{^17.7_(0.6)}	88.62M	config, model, log
512	t2v	_{^9.8_(0.3)}	_{^28.6_(0.4)}	_{^40.6_(0.4)}	_{^17.0_(0.0)}	119.51M	config, model, log
640	t2v	_{^10.1_(0.1)}	_{^28.8_(0.1)}	_{^40.9_(0.2)}	_{^16.7_(0.6)}	151.12M	config, model, log
768	t2v	_{^10.0_(0.1)}	_{^29.0_(0.3)}	_{^41.2_(0.2)}	_{^16.0_(0.0)}	183.45M	config, model, log
1024	t2v	_{^9.9_(0.1)}	_{^28.6_(0.3)}	_{^40.7_(0.4)}	_{^17.0_(0.0)}	250.27M	config, model, log
384	v2t	_{^14.0_(0.5)}	_{^38.7_(0.5)}	_{^52.7_(1.4)}	_{^9.3_(0.6)}	88.62M	config, model, log
512	v2t	_{^14.8_(0.4)}	_{^40.4_(0.6)}	_{^53.9_(0.4)}	_{^8.8_(0.3)}	119.51M	config, model, log
640	v2t	_{^15.6_(0.6)}	_{^41.3_(0.7)}	_{^55.0_(0.5)}	_{^8.3_(0.6)}	151.12M	config, model, log
768	v2t	_{^15.6_(0.3)}	_{^40.9_(1.4)}	_{^55.2_(1.0)}	_{^8.3_(0.6)}	183.45M	config, model, log
1024	v2t	_{^14.7_(0.4)}	_{^40.7_(0.8)}	_{^54.4_(0.3)}	_{^8.5_(0.5)}	250.27M	config, model, log

Training with more captions: Rather than varying the number of experts, we can also investigate how performance changes as we change the number of training captions available per-video.

Experts	Caps.	Task	R@1	R@5	R@10	MdR	Params	Links
RGB	1	t2v	_{^2.6_(0.1)}	_{^9.3_(0.4)}	_{^15.0_(0.7)}	_{^101.3_(15.5)}	56.7M	config, model, log
RGB	20	t2v	_{^4.9_(0.1)}	_{^16.5_(0.2)}	_{^25.3_(0.4)}	_{^40.7_(1.2)}	58.05M	config, model, log
All	1	t2v	_{^4.8_(0.2)}	_{^16.2_(0.5)}	_{^25.0_(0.7)}	_{^43.3_(4.0)}	183.45M	config, model, log
All	20	t2v	_{^10.0_(0.1)}	_{^29.0_(0.3)}	_{^41.2_(0.2)}	_{^16.0_(0.0)}	183.45M	config, model, log
RGB	1	v2t	_{^3.7_(0.3)}	_{^13.5_(0.6)}	_{^20.8_(0.4)}	_{^60.0_(2.0)}	56.7M	config, model, log
RGB	20	v2t	_{^6.9_(0.6)}	_{^21.0_(0.3)}	_{^31.3_(0.3)}	_{^30.0_(1.7)}	58.05M	config, model, log
All	1	v2t	_{^8.4_(0.5)}	_{^25.6_(0.7)}	_{^37.1_(0.2)}	_{^20.3_(0.6)}	183.45M	config, model, log
All	20	v2t	_{^15.6_(0.3)}	_{^40.9_(1.4)}	_{^55.2_(1.0)}	_{^8.3_(0.6)}	183.45M	config, model, log

Similar ablation studies for the remaining datasets can be found here.

Expert Zoo

For each dataset, the Collaborative Experts model makes use of a collection of pretrained "expert" feature extractors (see [1] for more precise descriptions). Some experts have been obtained from other sources (described where applicable), rather than extracted by us. To reproduce the experiments listed above, the experts for each dataset have been bundled into compressed tar files. These can be downloaded and unpacked with a utility script (recommended -- see example usage below), which will store them in the locations expected by the training code. Each set of experts has a brief README, which also provides a link from which they can be downloaded directly.

Dataset	Experts	Details and links	Archive size	sha1sum
MSRVTT	audio, face, flow, ocr, rgb, scene, speech	README	19.6 GiB	^{_{^{_{959bda588793ef05f348d16de26da84200c5a469}}}}
LSMDC	audio, face, flow, ocr, rgb, scene	README	6.1 GiB	^{_{^{_{7ce018e981752db9e793e449c2ba5bc88217373d}}}}
MSVD	face, flow, ocr, rgb, scene	README	2.1 GiB	^{_{^{_{6071827257c14de455b3a13fe1e885c2a7887c9e}}}}
DiDeMo	audio, face, flow, ocr, rgb, scene, speech	README	2.3 GiB	^{_{^{_{6fd4bcc68c1611052de2499fd8ab3f488c7c195b}}}}
ActivityNet	audio, face, flow, ocr, rgb, scene, speech	README	3.8 GiB	^{_{^{_{b16685576c97cdec2783fb89ea30ca7d17abb021}}}}

QuerYD

MODEL study on QUERYD

Importance of the model:

Model	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
HowTo100m S3D	t2v	_{^13.5_(0.0)}	_{^27.5_(0.0)}	_{^34.5_(0.0)}	_{^57.0_(0.0)}	_{^35.0_(0.0)}	_{^72.5_(0.0)}	_{^23.4_(0.0)}	1	config, model, log
CE - P,CG	t2v	_{^11.6_(1.3)}	_{^30.2_(3.0)}	_{^43.2_(3.1)}	_{^74.8_(1.7)}	_{^14.2_(1.6)}	_{^42.7_(2.6)}	_{^24.7_(1.9)}	57.75M	config, model, log
CE	t2v	_{^13.9_(0.8)}	_{^37.6_(1.2)}	_{^48.3_(1.4)}	_{^78.8_(0.7)}	_{^11.3_(0.6)}	_{^35.1_(1.6)}	_{^29.3_(0.8)}	30.82M	config, model, log
HowTo100m S3D	v2t	_{^12.4_(0.0)}	_{^23.8_(0.0)}	_{^30.8_(0.0)}	_{^57.0_(0.0)}	_{^33.0_(0.0)}	_{^73.4_(0.0)}	_{^20.9_(0.0)}	1	config, model, log
CE - P,CG	v2t	_{^13.0_(3.1)}	_{^30.9_(2.0)}	_{^43.0_(2.8)}	_{^73.2_(0.1)}	_{^14.5_(1.8)}	_{^42.6_(1.5)}	_{^25.7_(2.3)}	57.75M	config, model, log
CE	v2t	_{^13.7_(0.7)}	_{^35.2_(2.7)}	_{^46.9_(3.2)}	_{^78.3_(2.8)}	_{^12.3_(1.5)}	_{^35.8_(2.4)}	_{^28.3_(1.5)}	30.82M	config, model, log

The influence of different pretrained experts for the performance of the CE model trained on QuerYD is studied. The value and cumulative effect of different experts for scene clas-sification (SCENE), ambient sound classification (AUDIO),image classification (OBJECT), and action recognition (ACTION) are presented. PREV. denotes the experts used in the previous row.

Ablation studies on QuerYD

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the QuerYD dataset.

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
Scene	t2v	_{^8.7_(0.4)}	_{^26.3_(1.1)}	_{^37.1_(0.7)}	_{^68.5_(2.2)}	_{^22.2_(1.6)}	_{^52.3_(3.0)}	_{^20.4_(0.1)}	7.51M	config, model, log
Scene + Inst.	t2v	_{^11.7_(1.4)}	_{^31.6_(0.9)}	_{^43.4_(1.3)}	_{^74.5_(0.9)}	_{^14.0_(1.0)}	_{^41.1_(2.1)}	_{^25.2_(0.8)}	17.25M	config, model, log
Scene + r2p1d	t2v	_{^11.7_(2.1)}	_{^32.1_(3.0)}	_{^45.3_(3.3)}	_{^74.6_(0.4)}	_{^13.7_(1.9)}	_{^42.9_(2.2)}	_{^25.7_(2.4)}	16.07M	config, model, log
Scene + Audio	t2v	_{^7.6_(2.7)}	_{^27.4_(1.4)}	_{^40.4_(0.9)}	_{^69.1_(0.9)}	_{^17.0_(1.7)}	_{^49.0_(1.9)}	_{^20.2_(2.3)}	17.25M	config, model, log
Scene	v2t	_{^9.1_(0.8)}	_{^25.4_(0.9)}	_{^35.3_(1.5)}	_{^68.2_(2.2)}	_{^23.2_(0.3)}	_{^52.6_(2.6)}	_{^20.1_(0.5)}	7.51M	config, model, log
Scene + Inst.	v2t	_{^11.9_(0.5)}	_{^31.0_(3.6)}	_{^43.5_(2.7)}	_{^74.8_(1.8)}	_{^14.5_(0.9)}	_{^40.8_(2.1)}	_{^25.2_(1.1)}	17.25M	config, model, log
Scene + r2p1d	v2t	_{^12.7_(1.4)}	_{^30.9_(2.8)}	_{^44.0_(1.8)}	_{^74.3_(1.2)}	_{^14.3_(1.2)}	_{^42.8_(1.7)}	_{^25.8_(1.7)}	16.07M	config, model, log
Scene + Audio	v2t	_{^10.1_(1.2)}	_{^25.7_(1.5)}	_{^37.5_(1.2)}	_{^69.8_(1.6)}	_{^20.0_(1.3)}	_{^48.9_(2.0)}	_{^21.3_(1.1)}	17.25M	config, model, log

We can also study their cumulative effect:

Experts	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
Scene	t2v	_{^8.7_(0.4)}	_{^26.3_(1.1)}	_{^37.1_(0.7)}	_{^68.5_(2.2)}	_{^22.2_(1.6)}	_{^52.3_(3.0)}	_{^20.4_(0.1)}	7.51M	config, model, log
Prev. + Audio	t2v	_{^7.6_(2.7)}	_{^27.4_(1.4)}	_{^40.4_(0.9)}	_{^69.1_(0.9)}	_{^17.0_(1.7)}	_{^49.0_(1.9)}	_{^20.2_(2.3)}	17.25M	config, model, log
Prev. + Inst	t2v	_{^12.7_(1.7)}	_{^34.8_(1.7)}	_{^47.0_(1.3)}	_{^78.0_(1.0)}	_{^12.3_(0.6)}	_{^37.6_(2.1)}	_{^27.5_(1.5)}	24.63M	config, model, log
Prev. + R2P1D	t2v	_{^14.3_(0.3)}	_{^37.5_(1.3)}	_{^48.6_(0.8)}	_{^78.8_(0.3)}	_{^11.3_(0.6)}	_{^35.2_(1.8)}	_{^29.7_(0.3)}	30.82M	config, model, log
Scene	v2t	_{^9.1_(0.8)}	_{^25.4_(0.9)}	_{^35.3_(1.5)}	_{^68.2_(2.2)}	_{^23.2_(0.3)}	_{^52.6_(2.6)}	_{^20.1_(0.5)}	7.51M	config, model, log
Prev. + Audio	v2t	_{^10.1_(1.2)}	_{^25.7_(1.5)}	_{^37.5_(1.2)}	_{^69.8_(1.6)}	_{^20.0_(1.3)}	_{^48.9_(2.0)}	_{^21.3_(1.1)}	17.25M	config, model, log
Prev. + Inst.	v2t	_{^12.8_(1.3)}	_{^33.5_(2.8)}	_{^46.6_(1.0)}	_{^76.7_(1.7)}	_{^11.8_(0.8)}	_{^37.6_(1.9)}	_{^27.1_(0.6)}	24.63M	config, model, log
Prev. + R2P1D	v2t	_{^14.0_(0.3)}	_{^35.4_(2.9)}	_{^47.2_(2.8)}	_{^78.7_(2.4)}	_{^12.3_(1.5)}	_{^35.8_(2.4)}	_{^28.6_(1.2)}	30.82M	config, model, log

QuerYDSegments

MODEL study on QUERYDSEGMENTS

Importance of the model:

Model	Task	R@1	R@5	R@10	R@50	MdR	MnR	Geom	params	Links
HowTo100m S3D	t2v	_{^6.7_(0.0)}	_{^14.7_(0.0)}	_{^20.4_(0.0)}	_{^36.6_(0.0)}	_{^133.0_(0.0)}	_{^342.0_(0.0)}	_{^12.6_(0.0)}	1	config, model, log
CE - P,CG	t2v	_{^19.0_(0.8)}	_{^38.9_(1.0)}	_{^47.9_(0.7)}	_{^68.0_(0.4)}	_{^12.0_(1.0)}	_{^127.4_(5.9)}	_{^32.8_(0.6)}	57.75M	config, model, log
CE	t2v	_{^18.2_(0.5)}	_{^38.1_(0.8)}	_{^46.8_(0.4)}	_{^67.3_(0.7)}	_{^13.3_(0.6)}	_{^127.5_(3.9)}	_{^31.9_(0.4)}	30.82M	config, model, log
HowTo100m S3D	v2t	_{^8.4_(0.0)}	_{^15.4_(0.0)}	_{^19.8_(0.0)}	_{^34.2_(0.0)}	_{^154.5_(0.0)}	_{^363.0_(0.0)}	_{^13.7_(0.0)}	1	config, model, log
CE - P,CG	v2t	_{^19.8_(0.2)}	_{^39.6_(0.6)}	_{^47.6_(0.1)}	_{^67.9_(0.5)}	_{^13.0_(0.0)}	_{^124.3_(5.5)}	_{^33.4_(0.2)}	57.75M	config, model, log
CE	v2t	_{^18.1_(0.6)}	_{^37.3_(0.5)}	_{^45.9_(0.6)}	_{^67.2_(0.2)}	_{^14.0_(1.0)}	_{^123.9_(3.3)}	_{^31.4_(0.4)}	30.82M	config, model, log

Evaluating a pretrained model

Evaluting a pretrained model for a given dataset requires:

The pretrained experts for the target dataset, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
A config.json file.
A trained_model.pth file.

Evaluation is then performed with the following command:

python3 test.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the MSVD results described above, run the following sequence of commands:

# fetch the pretrained experts for MSVD 
python3 misc/sync_experts.py --dataset MSVD

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"

# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config

Training a new model

Training a new video-text embedding requires:

The pretrained experts for the dataset used for training, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.

For example, to train a new embedding for the LSMDC dataset, run the following sequence of commands:

# fetch the pretrained experts for LSMDC 
python3 misc/sync_experts.py --dataset LSMDC

# Train the model
python3 train.py --config configs/lsmdc/train-full-ce.json --device 0

Visualising the retrieval ranking

Tensorboard lacks video support via HTML5 tags (at the time of writing) so after each evaluation of a retrieval model, a simple HTML file is generated to allow the predicted rankings of different videos to be visualised: an example screenshot is given below (this tool is inspired by the visualiser in the pix2pix codebase). To view the visualisation, navigate to the web directory (this is generated for each experiment, and will be printed in the log during training) and run python3 -m http.server 9999, then navigate to localhost:9999 in your web browser. You should see something like the following:

visualisation

Note that the visualising the results in this manner requires that you also download the source videos for each of the datasets to some directory <src-video-dir>. Then set the visualizer.args.src_video_dir attribute of the training config.json file to point to <src-video-dir>.

Dependencies

Dependencies can be installed via pip install -r requirements/pip-requirements.txt.

References

[1] If you find this code useful or use the extracted features, please consider citing:

@inproceedings{croitoru2021teachtext,
  title={Teachtext: Crossmodal generalized distillation for text-video retrieval},
  author={Croitoru, I. and Bogolin, S. and Leordeanu, M. and Jin, H. and Zisserman, A. and Albanie, S. and Liu, Y.},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={11583--11593},
  year={2021}
}

@inproceedings{Liu2019a,
  author    = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
  booktitle = {arXiv preprint arxiv:1907.13487},
  title     = {Use What You Have: Video retrieval using representations from collaborative experts},
  date      = {2019},
}

[2] If you make use of the MSRVTT or LSMDC features provided by Miech et al. (details are given in their respective READMEs here and here), please cite:

@article{miech2018learning,
  title={Learning a text-video embedding from incomplete and heterogeneous data},
  author={Miech, Antoine and Laptev, Ivan and Sivic, Josef},
  journal={arXiv preprint arXiv:1804.02516},
  year={2018}
}

Acknowledgements

This work was inspired by a number of prior works for learning joint embeddings of text and video, but in particular the Mixture-of-Embedding-Experts method proposed by Antoine Miech, Ivan Laptev and Josef Sivic (paper, code). We would also like to thank Zak Stone and Susie Lim for their help with using Cloud TPUs. The code structure uses the pytorch-template by @victoresque.