20 # Video Representation Learning with Visual Tempo Consistency

Ten Questions to Ask When Reading a Paper

1. What is the problem addressed in the paper?

The authors demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. In fact, they apply the core idea of the SlowFast network (a Slow pathway, operating at low frame rate, to capture spatial semantics, and a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.) to self-supervised learning with contrastive learning. So basically, the backbone network input as the same instance with different tempos, which should share high similarity in terms of their discriminative semantics while are dissimilar to other instances. Here they add a projection layer and construct similar loss between the fast and slow version of one clip, and dissimilar loss between the fast/slow view of one clip and others in the memory bank. Besides the last layer's contrastive loss, they add contrastive loss hierarchically (e.g. in a Resnet like backbones, they collect features from res4,5. ). What's more, they proposed an instance correspondence map (ICM) to visualize the core objects spatially and temporally localized by the encoders.

2. Is this a new problem?

It is not a new problem, and the method is not quite novel. But it demonstrates the possibility of applying SlowFast like methods to SSL. It still follows the classical SSL methods (Moco.).

3. What is the scientific hypothesis that the paper is trying to verify?

The visual tempo can be used as a supervision signal. It is different from directly predicting the pace of speed as they think this kind of method may enforce the learned representations to capture the information that distinguishes the frequency of visual tempos, which is not necessarily related to the discriminative semantics they are looking for. However, there is no experiment in the paper to check this hypothesis. And they just turn to use contrastive learning.

4. What are the key related works and who are the key people working on this topic?

Contrastive learning, SSL, MoCo, SlowFast

5. What is the key to the proposed solution in the paper?

Visual tempo (SlowFast), contrastive learning(MoCo), hierarchical contrastive.

6. How are the experiments designed?

Experiments settings. Main results of action recognition with comparison to prior methods. Ablation study. Evaluations on other downstream tasks. Interpret the learned representations via ICM.

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

The models are pretrained on Kinetics-400. For downstream tasks, they use UCF101 and HMDB51.

8. Is the scientific hypothesis well supported by evidence in the experiments?

I think so.

9. What are the contributions of the paper?

It demonstrates the possibility of applying SlowFast like methods to SSL.

10. What should/could be done next?

Can we apply visual tempo without contrastive learning in SSL?

XFeiF / ComputerVision_PaperNotes