XFeiF / ComputerVision_PaperNotes

📚 Paper Notes (Computer vision)
1 stars 0 forks source link

20 | Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning #28

Closed XFeiF closed 3 years ago

XFeiF commented 3 years ago

Paper
No code available now~
Authors:
Zehua Zhang, David Crandall (Indiana University Bloomington)

XFeiF commented 3 years ago

Ten Questions to Ask When Reading a Paper


1. What is the problem addressed in the paper?

The authors argue that good video representations should be able to capture spatial and temporal features in a more general form at multiple scales. Thus, as summarized in the title, they (a) decouple the learning objective (spatial and temporal) into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) perform it hierarchically to encourage multi-scale understanding.

2. Is this a new problem?

It is not a new problem with video representation learning (VRL). But for the application of contrastive learning in VRL, it is the very first one to decouple spatial-temporal into two subtasks. The proposed method of composing new temporal positive pairs is very useful! (as I do not come up with it ever...)

3. What is the scientific hypothesis that the paper is trying to verify?

5. What is the key of the proposed solution in the paper?

6. How are the experiments designed?

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

Pretraining on kinetics-400, finetuning with UCF101 and HMDB51, ten-crops test top 1 accuracy.
Code is not open-sourced yet.

8. Is the scientific hypothesis well supported by evidence in the experiments?

Yes, see Question 6.

9. What are the contributions of the paper?

10. What should/could be done next?

This work provides us with information about useful augmentations and the way to decouple spatial-temporal contrast into two. This is some basic 'elements' we can use in other works.