Closed XFeiF closed 3 years ago
The authors argue that good video representations should be able to capture spatial and temporal features in a more general form at multiple scales. Thus, as summarized in the title, they (a) decouple the learning objective (spatial and temporal) into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) perform it hierarchically to encourage multi-scale understanding.
It is not a new problem with video representation learning (VRL). But for the application of contrastive learning in VRL, it is the very first one to decouple spatial-temporal into two subtasks. The proposed method of composing new temporal positive pairs is very useful! (as I do not come up with it ever...)
Key related words:
unsupervised video representation learning, contrastive learning, multi-scale
SimCLR, MoCo
Pretraining on kinetics-400, finetuning with UCF101 and HMDB51, ten-crops test top 1 accuracy.
Code is not open-sourced yet.
Yes, see Question 6.
This work provides us with information about useful augmentations and the way to decouple spatial-temporal contrast into two. This is some basic 'elements' we can use in other works.
Paper
![](https://github.com/XFeiF/Video_PaperNotes/blob/master/imgs/HDC_example.png?raw=true)
No code available now~
Authors:
Zehua Zhang, David Crandall (Indiana University Bloomington)