OpenGVLab / UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
https://arxiv.org/abs/2211.09552
Apache License 2.0
294 stars 19 forks source link

Visualizing Global Temporal Features? #67

Open 2198310932i1 opened 8 months ago

2198310932i1 commented 8 months ago

Thanks for the very cool work! I was wondering... Is there a way to visualize Global Temporal Features of the Global MHRA in UniformerV2 like you did for UniformerV1?

I get that you reduce computational costs by doing cross-attention between class token and spatio-temporal tokens. Since this cross-attention happens on clones of the features, and since this cross-attention outputs a new class tokem, how are the global temporal features fused with the remaining features...

What I would like to do is have access to the spatial features (without the temporal features) and access to the combined spatio-temporal features...

Andy1621 commented 8 months ago

You can try to show the attention score of the global cross attention.