20NIPS| Self-supervised Co-training for Video Representation Learning

XFeiF commented 3 years ago

TEN QUESTIONS

1. What is the problem addressed in the paper? The recent proposed self-supervised representation learning methods for images and videos are based on discriminating its transformed version against other samples in the dataset. But in this paper, the authors found that instance discrimination is not making the best use of data. And the proposed CoCLR incorporates learning from potentially harder positives, e.g. instances from the same class, rather than from only different augmentations of the same instance.
The proposed CoCLR (training regime based on InfoNCE loss):

Backbone: S3D for RGB frames, TV-L1(un-supervised) for optic flow computing. Optimization: InfoNCE. Input: RGB images; Output: RGB feature maps(representation), Flow feature maps.
Training: (two stages, initialization and alternation)
- Initialization: the two models with different views are trained independently with InfoNCE
- Alternation: co-training process. When optimizing RGB InfoNCE, fix Flow network to mine hard positive pairs by selecting the top K items with the minimal Flow feature differences. And when optimizing Flow InfoNCE, fix RGB network to mine hard positive pairs by selecting the top K items with the minimal RGB feature differences.
Testing: (downstream tasks for evaluation):
- Action classification: (1) linear probe; (2) finetune.
- Action retrieval: k-NNs.

2. Is this a new problem? It is not a new problem, but it is important as it reveals the fact that instance discrimination cannot make full use of data. And they proposed a self-supervised co-training scheme to improve the loss of InfoNCE.

3. What is the scientific hypothesis that the paper is trying to verify? Is instance discrimination making the best use of data? The answer is no.

4. What are the key related works and who are the key people working on this topic?
Key related works: InfoNCE, Moco, SimCLR.
Key people: the authors(Tenda Han, Weidi Xie, Andrew Zisserman), Kaiming He, etc.

5. What is the key to the proposed solution in the paper? The key is how to mine positive samples. The authors proposed to use other complementary views of the data. In this paper, they use optic flow to bridge the gap between RGB video clip instances of the same class.
The idea is generally applicable for other complementary views: for videos, audio or text narrations can play a similar role to optical flow; whilst for still images, the multiple views can be formed by passing images through different filters.

6. How are the experiments designed?

InfoNCE v.s. UberNCE v.s. CoCLR on RGB, Flow, R++F;
Action classification (acc) and Action retrieval (R@1) performance on UCF101;
Ablation study on top K;
Comparison with the SoTA (classification accuracy evaluation on UCF101, HMDB51)
Comparison with others on Nearest-Neighbour video retrieval on UCF101, HMDB51.

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?
Training: UCF101, Kinetics-400.
Evaluation: UCF101, Kinetics-400, HMDB51.
Code available.

8. Is the scientific hypothesis well supported by evidence in the experiments?
The designed UberNCE is compared with raw InfoNCE and the result demonstrate that instance discrimination is not making the best use of data.

9. What are the contributions of the paper?

Instance discrimination is not making the best use of data.
CoCLR (training regime)
SoTA or comparable performance over other self-supervised methods

10. What should/could be done next?
Give your answers?

XFeiF commented 3 years ago

Q3 shows us the problem to be tackled and Q5 provides the authors' answer. So the questions are:

Any other ways to mine positive samples?
Is this way of mining positive samples best enough? For example, the "K" is a hyperparameter and also the alternation training process.

It is a good inspiration for multi-modal self-supervised learning.

XFeiF / ComputerVision_PaperNotes

20NIPS| Self-supervised Co-training for Video Representation Learning #22

TEN QUESTIONS