20 | Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

Ten Questions to Ask When Reading a Paper

1. What is the problem addressed in the paper?

The authors argue that good video representations should be able to capture spatial and temporal features in a more general form at multiple scales. Thus, as summarized in the title, they (a) decouple the learning objective (spatial and temporal) into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) perform it hierarchically to encourage multi-scale understanding.

2. Is this a new problem?

It is not a new problem with video representation learning (VRL). But for the application of contrastive learning in VRL, it is the very first one to decouple spatial-temporal into two subtasks. The proposed method of composing new temporal positive pairs is very useful! (as I do not come up with it ever...)

3. What is the scientific hypothesis that the paper is trying to verify?

Can the decoupled contrastive learning better than the joint one?
- The decoupled one is much easier for the model to optimize. Neural networks are notorious for learning shortcuts to 'cheat' (if there is an easier way to solve the problem, the network will hardly try to find a more complex solution.)
- In the Spatial Contrast subtask, they provide a shortcut by creating augmented variants with only spatial augmentations. As the timestamps of the query clip and its augmented copy are the same, it is possible to solve the matching task based merely on consistency of spatial semantics, and thus the network will try to "cheat" by focusing more on spatial features.
- In Temporal Contrast learning, they randomly select a new clip from the video of the query clip (random temporal cropping) before applying spatial augmentations in order to obtain a variant whose spatial semantics are as different from the query clip as possible. Since spatial context may vary dramatically after applying the temporal and spatial transformations, the model is prevented from 'cheating' through spatial similarity and encouraged to rely more on similarity of temporal semantics to solve the pretext task.
And hierarchically contrast can perform better than the last-layer contrast?
- Features from different layers do not share the same level of invariance against augmentations. Using different weights for different layers may be good.
  4. What are the key related works and who are the key people working on this topic?
  
  Key related words:
  unsupervised video representation learning, contrastive learning, multi-scale
  SimCLR, MoCo

5. What is the key of the proposed solution in the paper?

The 'cheating' property of the neural networks.
The way how to decouple spatial-temporal contrast into two subtasks, especially how to construct temporal positive pairs.
Feature hierarchically contrastive learning.

6. How are the experiments designed?

Common protocol experiments settings. (Pretraining on kinetics-400, finetuning with UCF101 and HMDB51, ten-crops test top 1 accuracy.)
Ablation Studies
- Decoupled Contrast, with Temporal/Spatial Contrast only.
- Hierarchical Contrast, better.
- Spatial augmentation ablation. Color jittering and channel replication are crucial augmentations!
- Spatial augmentations in Temporal Contrast. Important! To obtain variants whose spatial context varies as much as possible.
Comparison with the SoTA.
Nearest Neighbor Retrieval.

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

Pretraining on kinetics-400, finetuning with UCF101 and HMDB51, ten-crops test top 1 accuracy.
Code is not open-sourced yet.

8. Is the scientific hypothesis well supported by evidence in the experiments?

Yes, see Question 6.

9. What are the contributions of the paper?

It decomposes the goal of spatial-temporal feature learning into multi-scale subtasks emphasizing spatial and temporal representations, respectively.
It applies different augmentations to produce spatially- and temporally-augmented variants so that the network will be directed to learn desired features.
Instance-level invariance is enforced at multiple scales with different weights to adjust the signiﬁcance, which guides the model to learn rich hierarchical representations.

10. What should/could be done next?

This work provides us with information about useful augmentations and the way to decouple spatial-temporal contrast into two. This is some basic 'elements' we can use in other works.

XFeiF / ComputerVision_PaperNotes