20ACMMM # Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Ten Questions to Ask When Reading a Paper

1. What is the problem addressed in the paper?

The authors propose a self-supervised method to learn feature representations from videos by using inter-intra contrastive framework. Different views (or other modalities like optical flow or frame difference ) of the given video are treated as positives. Data from other videos are treated as inter-negatives. Then, they construct intra-negative samples by breaking down temporal relations in the anchor view, which can help the model learn temporal information.
However, they mainly evaluate their method on the retrieval task as they designed a joint representation that joint retrieval using two different kinds of input data with only one network. (I think it is in general no supervise as they use anthor view, more information to the retrieval task, even though just one model.)

2. Is this a new problem?

It is not a new problem. In this paper, they make the model learn temporal information by giving temporal augmented negative samples named as intra negative pairs. It should be noted that make temporal contrast learning is quite important for video representation learning. And the innovation key mainly hides behind how to utilize the temporal information. Here, in self-supervised contrastive representation learning, the key is the way to build temporal contrast pairs.

3. What is the scientific hypothesis that the paper is trying to verify?

I think it should be how to generate useful intra-negative samples from the original video clips.

4. What are the key related works and who are the key people working on this topic?

Keywords:
Self-supervised learning, video representation, video recognition, video retrieval, spatial-temporal convolution

5. What is the key of the proposed solution in the paper?

The way of constructing positive pairs and negative pairs.
In common contrastive learning methods, positive samples of the anchor view are always different augmented variants. In this paper, the option for different views are original RGB clips, optical flow (u or v) frame clips, and stacked frame differences. Noted that anchor view is RGB clips.
The negative pairs are categorized into two types - inter negative pairs and intra negatives. The temporal augmented intra negatives are the key to the boost performance.

6. How are the experiments designed?

Ablation study.

Joint retrieval
Option configurations: 1) whether to use intra-negative samples or not; 2) which intra-negative sample generation method to use; 3) which modality was to be chosen for the second view. Visualization: feature embedding.
Comparison: video retrieval. Comparison: video recognition.

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

Kinetics400, UCF101, HMDB51.
Code is open-sourced.

8. Is the scientific hypothesis well supported by evidence in the experiments?

9. What are the contributions of the paper?

They utilized the advantages of intra- and inter-sample learning and trained a spatio-temporal convolution neural network (3DCNN) with intra-negative samples in contrastive multiview coding.

10. What should/could be done next?

To be discussed.

XFeiF / ComputerVision_PaperNotes