OpenGVLab / UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
https://arxiv.org/abs/2211.09552
Apache License 2.0
294 stars 19 forks source link

关于输入X的维度问题 #56

Closed perfectFeng closed 11 months ago

perfectFeng commented 12 months ago

您好,我有注意到在计算attention之前,输入X的维度为(HW, BT, C),这相当于在不同的batch和T上计算attention,为什么要这么设置呢 1701327387547

Andy1621 commented 12 months ago

因为使用的是CLIP的预训练,主干分支是用的是spatial attention,每帧单独操作