关于输入X的维度问题

OpenGVLab / UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

https://arxiv.org/abs/2211.09552

Apache License 2.0

294 stars 19 forks source link

Closed perfectFeng closed 11 months ago

perfectFeng commented 12 months ago

您好，我有注意到在计算attention之前，输入X的维度为（HW, BT, C），这相当于在不同的batch和T上计算attention，为什么要这么设置呢 1701327387547

Andy1621 commented 12 months ago

因为使用的是CLIP的预训练，主干分支是用的是spatial attention，每帧单独操作