OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
660 stars 47 forks source link

UMT_QA Usage #31

Closed NyleSiddiqui closed 2 months ago

NyleSiddiqui commented 2 months ago

Hi,

I have been working on taking VideoMamba's multi-modal pre-trained weights and applying it to other VQA downstream tasks/datasets. So far, I have been using the UMT_VideoMamba model as this is the model that is compatible with the provided pre-trained weights, but I stumbled upon the UMT_QA model which appears to be specifically tailored for VQA (i.e processes both questions and answers simultaneously + ranking) along with what appears to be outdated qa config files in the 'configs' folder. Before I spend time looking into this UMT_QA model, I just wanted to confirm whether it was ever used, or if it is deprecated. I was not able to find any references to the model in the repo besides initialization, so I am assuming it was just a research idea that ended up not being used. Thanks in advance for your help!

Andy1621 commented 2 months ago

Hi! UMT is my previous paper, and this repo is built based on it. I have fine-tuned UMT for UMT_QA. For VideoMamba, you may need to change the code UMT_QA for UMT_VideoMamba_QA.