OpenGVLab / VideoMamba

[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
846 stars 61 forks source link

How to achieve the temporal-first scan and spatialtemporal scan v1 & v2? #80

Open JACOBWHY opened 3 months ago

JACOBWHY commented 3 months ago

Hi , thanks for your brilliant work! I saw you use 4 kinds of scan methods in Fig.4 in the paper. I guess only spatial-first bidirectional scan is used in the mamba_simple.py. I am intrigued by the other three scanning methodologies you've mentioned. Would it be possible to kindly share some guidance on how to implement the remaining methods—namely, the temporal-first scan, as well as the spatial-temporal scan versions 1 and 2? Your insights would be immensely valuable. Thank you very much for considering my request.

Andy1621 commented 3 months ago

Hi! The key differences between different scans are the tensor order.

You can use permute and reshape to achieve different scans.