20 | Can Temporal Information Help with Contrastive Self-Supervised Learning?

Ten Questions to Ask When Reading a Paper

1. What is the problem addressed in the paper?

The authors try to find a good way to incorporate temporal information into instance discrimination based contrastive self-supervised learning framework (CSL). They present a general paradigm to enhance video CSL named Temporal-aware Contrastive self-supervised learning (TaCo). TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. It boosts the performance in a multi-task training manner as it uses task head for every task.

2. Is this a new problem?

For video understanding, it is not a new problem. But it is the first detailed discussion on how to integrate temporal information in video CSL.

3. What is the scientific hypothesis that the paper is trying to verify?

In general, this paper tries to answer three questions:

Q1: Can we just resort to adding temporal augmentations with existing CSL frameworks?
- A: They experiment different augmentation strategies including: 1) spatial augmentation, 2) temporally consistent spatial augmentation, 3) temporal augmentation.
- They find that directly applying temporal augmentation shows limited improvement, or is even detrimental.
Q2: Is there a more suitable way to model temporal information and learn a better 'solution' for video CSL?
- A: The proposed TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision. What's more important, it uses newly proposed task head beside the projection head for each temporal transformed video.
Q3: Is there any innate relation between different video tasks? How can multiple video tasks help with CSL?
- A: When tasks harmonize with each other, TaCo with dual-task setting can further boost the performance.

(In my opinion, It is not an elegant design of TaCo. Q3 is more likely an extra explanation for its design. And I do not think TaCo can serve as a general paradigm to video CSL.)

4. What are the key related works and who are the key people working on this topic?

CRL: InstDisc, MoCo, PIRL, SimCLR and etc. InfoNCE loss. Video SSL: rotation, reverse, rotation, shuffle, speed, etc.

5. What is the key of the proposed solution in the paper?

1) Temporal augmentations. 2) Additional task head. 3) Joint multi-task training.

6. How are the experiments designed?

Self-supervised pretraining: backbone (3D-ResNets, slow path in slowfast), lr (0.05, warm-up in the first 5 epochs), optimizer (SGD, momentum 0.9), mini-batch size (1024). 8 consecutive frames are used as the input unless otherwise specified.
Contrastive learning framework: InstDisc, MoCo
Temporal pretext tasks: rotation jittering task, temporal reverse task, temporal shuffle task, temporal speed task.
Supervised classification: fully finetune SSL model for different downstream classification tasks. 10-crops evaluation as standard protocol.
Linear evaluation: only finetune the last linear layer.

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

They conduct TaCo pretraining on Kinetics 400 dataset without using labels. For fine-tuning and testing, they evaluate on both UCF-101 and HMDB-51 datasets.
The code is not open-sourced now.

8. Is the scientific hypothesis well supported by evidence in the experiments?

See question 3.

9. What are the contributions of the paper?

Directly applying temporal augmentation shows limited improvement, or is even detrimental.
TaCo enables effective integration of temporal information by selecting temporal transformations not only as strong augmentation but also to constitute extra self-supervision under CSL paradigm.
TaCo can well accommodate various temporal transformations, backbones and CSL approaches.

10. What should/could be done next?

The most useful information is directly applying temporal augmentations does not help. A general idea is find other ways of integrate temporal information into self-supervised learning (including CSL) frameworks.

XFeiF / ComputerVision_PaperNotes