OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
660 stars 47 forks source link

About mamba block #50

Closed lebron-2016 closed 1 month ago

lebron-2016 commented 1 month ago

Dear author,

I have two questions about the mamba block, and I hope to get your answers. First of all, does drop_path_rate apply to 14*14 patches? That is, certain patches are skipped at a certain proportion. Secondly, how to set or modify the scanning method? Where is the code corresponding to the scanning method?

By the way, does the blue box in the picture below represent a frame of picture? If so, I don't particularly understand the difference between spatial-first and temporal-first.

image

Looking forward to your reply!

Thanks!!

Andy1621 commented 1 month ago
  1. drop_path_rate is applied to a block, not a patch.
  2. To modify the scanning method, you have to change the tensor shape here.
  3. As shown in the axes, a blue box represents a video, and each row is a frame.
lebron-2016 commented 1 month ago
  1. drop_path_rate is applied to a block, not a patch.
  2. To modify the scanning method, you have to change the tensor shape here.
  3. As shown in the axes, a blue box represents a video, and each row is a frame.

Got it, thank you for your reply!

Regarding the second point, I would also like to ask whether the order of scanning depends on the order of the last dimension of xz. For example, now it is spatial-first scanning, and 1569 (14x14x8+1) patches are arranged frame by frame. If I want to implement the temporal-first method, do I only need to change the order of these 1569 patches?

image

Thanks!!

Andy1621 commented 1 month ago

Yes. For temporal first, you need to reshape and permute the tensor.

lebron-2016 commented 1 month ago

Yes. For temporal first, you need to reshape and permute the tensor.

OK. Thanks a lot!!