XFeiF / ComputerVision_PaperNotes

📚 Paper Notes (Computer vision)
1 stars 0 forks source link

20 | Can Temporal Information Help with Contrastive Self-Supervised Learning? #26

Closed XFeiF closed 3 years ago

XFeiF commented 3 years ago

Paper No code available now~.

Authors:
Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille

Overview of the proposed temporal-aware contrastive self-supervised learning framework (TaCo).
TaCo mainly comprises three modules: temporal augmentation module, contrastive learning module, and temporal pretext task module. For different temporal augmentations, they apply different projection heads and task heads. The features extracted from projection head of original video sequence and agumented sequence are considered as positive sample pairs, and the remaining ones are simply regarded as negative sample pairs. The contrastive loss is computed as the summation of losses over all pairs.

XFeiF commented 3 years ago

Ten Questions to Ask When Reading a Paper

1. What is the problem addressed in the paper?

The authors try to find a good way to incorporate temporal information into instance discrimination based contrastive self-supervised learning framework (CSL). They present a general paradigm to enhance video CSL named Temporal-aware Contrastive self-supervised learning (TaCo). TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. It boosts the performance in a multi-task training manner as it uses task head for every task.

2. Is this a new problem?

For video understanding, it is not a new problem. But it is the first detailed discussion on how to integrate temporal information in video CSL.

3. What is the scientific hypothesis that the paper is trying to verify?

In general, this paper tries to answer three questions:

(In my opinion, It is not an elegant design of TaCo. Q3 is more likely an extra explanation for its design. And I do not think TaCo can serve as a general paradigm to video CSL.)

4. What are the key related works and who are the key people working on this topic?

CRL: InstDisc, MoCo, PIRL, SimCLR and etc. InfoNCE loss. Video SSL: rotation, reverse, rotation, shuffle, speed, etc.

5. What is the key of the proposed solution in the paper?

1) Temporal augmentations. 2) Additional task head. 3) Joint multi-task training.

6. How are the experiments designed?

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

They conduct TaCo pretraining on Kinetics 400 dataset without using labels. For fine-tuning and testing, they evaluate on both UCF-101 and HMDB-51 datasets.
The code is not open-sourced now.

8. Is the scientific hypothesis well supported by evidence in the experiments?

See question 3.

9. What are the contributions of the paper?

  1. Directly applying temporal augmentation shows limited improvement, or is even detrimental.
  2. TaCo enables effective integration of temporal information by selecting temporal transformations not only as strong augmentation but also to constitute extra self-supervision under CSL paradigm.
  3. TaCo can well accommodate various temporal transformations, backbones and CSL approaches.

10. What should/could be done next?

The most useful information is directly applying temporal augmentations does not help. A general idea is find other ways of integrate temporal information into self-supervised learning (including CSL) frameworks.