Closed XFeiF closed 3 years ago
The authors propose a self-supervised method to learn feature representations from videos by using inter-intra contrastive framework. Different views (or other modalities like optical flow or frame difference ) of the given video are treated as positives. Data from other videos are treated as inter-negatives. Then, they construct intra-negative samples by breaking down temporal relations in the anchor view, which can help the model learn temporal information.
However, they mainly evaluate their method on the retrieval task as they designed a joint representation that joint retrieval using two different kinds of input data with only one network. (I think it is in general no supervise as they use anthor view, more information to the retrieval task, even though just one model.)
It is not a new problem. In this paper, they make the model learn temporal information by giving temporal augmented negative samples named as intra negative pairs. It should be noted that make temporal contrast learning is quite important for video representation learning. And the innovation key mainly hides behind how to utilize the temporal information. Here, in self-supervised contrastive representation learning, the key is the way to build temporal contrast pairs.
I think it should be how to generate useful intra-negative samples from the original video clips.
Keywords:
Self-supervised learning, video representation, video recognition, video retrieval, spatial-temporal convolution
The way of constructing positive pairs and negative pairs.
In common contrastive learning methods, positive samples of the anchor view are always different augmented variants. In this paper, the option for different views are original RGB clips, optical flow (u or v) frame clips, and stacked frame differences. Noted that anchor view is RGB clips.
The negative pairs are categorized into two types - inter negative pairs and intra negatives. The temporal augmented intra negatives are the key to the boost performance.
Ablation study.
Kinetics400, UCF101, HMDB51.
Code is open-sourced.
They utilized the advantages of intra- and inter-sample learning and trained a spatio-temporal convolution neural network (3DCNN) with intra-negative samples in contrastive multiview coding.
To be discussed.
Paper
Code-PyTorch
Authors: Li Tao, Xueting Wang∗, Toshihiko Yamasaki