Closed XFeiF closed 3 years ago
The authors demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. In fact, they apply the core idea of the SlowFast network (a Slow pathway, operating at low frame rate, to capture spatial semantics, and a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.) to self-supervised learning with contrastive learning. So basically, the backbone network input as the same instance with different tempos, which should share high similarity in terms of their discriminative semantics while are dissimilar to other instances. Here they add a projection layer and construct similar loss between the fast and slow version of one clip, and dissimilar loss between the fast/slow view of one clip and others in the memory bank. Besides the last layer's contrastive loss, they add contrastive loss hierarchically (e.g. in a Resnet like backbones, they collect features from res4,5. ). What's more, they proposed an instance correspondence map (ICM) to visualize the core objects spatially and temporally localized by the encoders.
It is not a new problem, and the method is not quite novel. But it demonstrates the possibility of applying SlowFast like methods to SSL. It still follows the classical SSL methods (Moco.).
The visual tempo can be used as a supervision signal. It is different from directly predicting the pace of speed as they think this kind of method may enforce the learned representations to capture the information that distinguishes the frequency of visual tempos, which is not necessarily related to the discriminative semantics they are looking for. However, there is no experiment in the paper to check this hypothesis. And they just turn to use contrastive learning.
Contrastive learning, SSL, MoCo, SlowFast
Visual tempo (SlowFast), contrastive learning(MoCo), hierarchical contrastive.
Experiments settings. Main results of action recognition with comparison to prior methods. Ablation study. Evaluations on other downstream tasks. Interpret the learned representations via ICM.
The models are pretrained on Kinetics-400. For downstream tasks, they use UCF101 and HMDB51.
I think so.
It demonstrates the possibility of applying SlowFast like methods to SSL.
Can we apply visual tempo without contrastive learning in SSL?
Paper.
Code (Not released until 2021-01-15)
Authors:
Ceyuan Yang, Yinghao Xu, Bo Dai, Bolei Zhou