NVlabs / MambaVision

Official PyTorch Implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone
https://arxiv.org/abs/2407.08083
Other
792 stars 43 forks source link

How to Input Video Sequences for Action Recognition #20

Closed 95AliceHong closed 3 months ago

95AliceHong commented 3 months ago

Hello,

I hope this message finds you well. I am writing to inquire about the applicability of the backbone network you have proposed for video processing. Specifically, I am working with a dataset that comprises multiple folders, each containing a varying number of image frames extracted from videos. My intention is to treat each folder as an input sequence, implying that the input dimension would ideally be (B, T, C, H, W), where T represents the number of frames.

Given that the network's standard input dimension is typically set as (B, C, H, W), I was wondering if it would be feasible to reinterpret B*T as the new batch size (B), effectively making B dynamic rather than fixed. This approach would allow me to accommodate sequences of varying lengths directly without the need for padding or truncation. I would greatly appreciate your insights on whether this is a viable solution or if there are alternative strategies you would recommend for handling such video frame sequences as inputs.

Thank you in advance for your time and expertise. I look forward to your guidance.

Best regards,

ahatamiz commented 3 months ago

Hi @95AliceHong

Thanks for this question. Although our work is not tailored for video processing, I find your solution to be technically viable. In essence, each frame in the aforementioned folders are images and can be collapsed into the batch size (as independent inputs).

This approach probably gives you some leverage in tailored ways of processing the frames in each video. For example, you may want to skip certain number of frames in a systematic fashion.

Also needless to mention, MambaVision supports processing of images with arbitrary resolutions which should be helpful in your pipeline.

I look forward to know what you can make with MambaVision for action recognition. We can also add a link to your repository if you built something based on our backbone.

95AliceHong commented 3 months ago

Thank you very much for your prompt and helpful response! I will definitely give it a try. Your guidance is much appreciated.