Closed XFeiF closed 3 years ago
1. What is the problem addressed in the paper?
The recent proposed self-supervised representation learning methods for images and videos are based on discriminating its transformed version against other samples in the dataset. But in this paper, the authors found that instance discrimination is not making the best use of data. And the proposed CoCLR incorporates learning from potentially harder positives, e.g. instances from the same class, rather than from only different augmentations of the same instance.
The proposed CoCLR (training regime based on InfoNCE loss):
2. Is this a new problem? It is not a new problem, but it is important as it reveals the fact that instance discrimination cannot make full use of data. And they proposed a self-supervised co-training scheme to improve the loss of InfoNCE.
3. What is the scientific hypothesis that the paper is trying to verify? Is instance discrimination making the best use of data? The answer is no.
4. What are the key related works and who are the key people working on this topic?
Key related works: InfoNCE, Moco, SimCLR.
Key people: the authors(Tenda Han, Weidi Xie, Andrew Zisserman), Kaiming He, etc.
5. What is the key to the proposed solution in the paper?
The key is how to mine positive samples. The authors proposed to use other complementary views of the data. In this paper, they use optic flow to bridge the gap between RGB video clip instances of the same class.
The idea is generally applicable for other complementary views: for videos, audio or text narrations can play a similar role to optical flow; whilst for still images, the multiple views can be formed by passing images through different filters.
6. How are the experiments designed?
7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?
Training: UCF101, Kinetics-400.
Evaluation: UCF101, Kinetics-400, HMDB51.
Code available.
8. Is the scientific hypothesis well supported by evidence in the experiments?
The designed UberNCE is compared with raw InfoNCE and the result demonstrate that instance discrimination is not making the best use of data.
9. What are the contributions of the paper?
10. What should/could be done next?
Give your answers?
Q3 shows us the problem to be tackled and Q5 provides the authors' answer. So the questions are:
It is a good inspiration for multi-modal self-supervised learning.
Paper & Code