21AAAI | Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

这篇文章的出发点是专注spatialtemporal整体特征的模型在很多类上“偷懒”，只通过motion的scene信息对motion进行分类，为了得到更好的特征，就需要分解场景scene和动作motion。最重要的地方在于分别针对spatial和temporal的disturbance方法，以及positive，negative的构建：

Spatial Local Disturbance ：这个变换打破了scene的信息，使得模型无法讨巧，另一方面保留了motion信息，以它作为 positive (motion-untouched, scene-broken) ，可以迫使模型学习motion信息。具体地，它通过一种在OCR中常用的矫正扭曲文本区域的方法Tin-Plate-Spline(TPS)方法来扭曲spatial context。
Temporal Local Distrubance : 与上一个相反，这个变换打破temporal信息而保留scence信息，以它构建 negative (motion-broken, scene-untouched) 样本。具体地，它通过两个步骤，(1) optical-flow scaling的方式，加速或者减速motion但同时不剧烈改变背景像素; (2) temporal shift，用于区分含有相同view的不同video，做法类似CVRL的思想，两个view 开始帧之间相差了一个随机大小的位置。

Loss构建：

Triplet Loss: positive 和 negative 的构建如上；
Contrastive Learning Loss: positive sample是motion-untouched, scene-broken的，其它video的raw clip作为negative。

本文的positive和negative的设定基于的强约束是 motion 最重要，迫使模型学习motion，而非scene的特征。说是分解，不如说使模型特定抑制一方。其他一些论文的方法里更倾向于是在学习spatialtemporal的特征的同时强调动作信息，而不是“抑制”。

另外，本文利用了多模态信息，在于其它方法比较的时候没有做到公正，提到了CVRL却没有比较性能。整体性能上也达不到2020年的SoTA。

XFeiF / ComputerVision_PaperNotes

21AAAI | Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion #27