Closed XFeiF closed 3 years ago
The authors try to find a good way to incorporate temporal information into instance discrimination based contrastive self-supervised learning framework (CSL). They present a general paradigm to enhance video CSL named Temporal-aware Contrastive self-supervised learning (TaCo). TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. It boosts the performance in a multi-task training manner as it uses task head for every task.
For video understanding, it is not a new problem. But it is the first detailed discussion on how to integrate temporal information in video CSL.
In general, this paper tries to answer three questions:
(In my opinion, It is not an elegant design of TaCo. Q3 is more likely an extra explanation for its design. And I do not think TaCo can serve as a general paradigm to video CSL.)
CRL: InstDisc, MoCo, PIRL, SimCLR and etc. InfoNCE loss. Video SSL: rotation, reverse, rotation, shuffle, speed, etc.
1) Temporal augmentations. 2) Additional task head. 3) Joint multi-task training.
They conduct TaCo pretraining on Kinetics 400 dataset without using labels. For fine-tuning and testing, they evaluate on both UCF-101 and HMDB-51 datasets.
The code is not open-sourced now.
See question 3.
The most useful information is directly applying temporal augmentations does not help. A general idea is find other ways of integrate temporal information into self-supervised learning (including CSL) frameworks.
Paper No code available now~.
Authors:
Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin Zhou, Qihang Yu, Vikas Chandra, Alan Yuille
TaCo mainly comprises three modules: temporal augmentation module, contrastive learning module, and temporal pretext task module. For different temporal augmentations, they apply different projection heads and task heads. The features extracted from projection head of original video sequence and agumented sequence are considered as positive sample pairs, and the remaining ones are simply regarded as negative sample pairs. The contrastive loss is computed as the summation of losses over all pairs.