boheumd / MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
https://boheumd.github.io/MA-LMM/
MIT License
178 stars 21 forks source link

Weights? #5

Closed mvsoom closed 2 months ago

mvsoom commented 2 months ago

Any chance of releasing the weights? I currently lack the compute to train this myself. Thanks!

boheumd commented 2 months ago

Hi, we have released the finetuned checkpoints for each video dataset, including activitynet, breakfast, coin, lvu, msrvtt, msvd, and youcook2. Our model can also leverage pre-trained weights from InstructBlip without any finetuning to conduct zero-shot evaluation on video datasets. You can refer to the README.md for more details.

mvsoom commented 2 months ago

Thanks for the reply, I see. Amazing that this just works in plug-and-play fashion.

Are the InstructBlip weights downloaded automatically? So only the Vicuna weights need to be downloaded manually?

Maybe a quick example on a test video (like a captioning task) would be beneficial. If I can get it to work in the coming days (new to LAVIS), I'll submit a PR.

boheumd commented 2 months ago

Hello, the demo.ipynb is now released. You can test it with your example video.

mvsoom commented 2 months ago

Much thanks!

nisargshah1999 commented 6 days ago

Hi @boheumd ; very interesting work For off-the-shelf testing,(zero-shot evaluation), do we need to load any other weights, other than pre-loaded checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth (which it loads automatically)

If above sounds true, can you confirm for LVU relationship task in off-the-shelf setting

top1: 0.00 top5: 0.00 or I might be doing something wrong

boheumd commented 1 day ago

Hi @boheumd ; very interesting work For off-the-shelf testing,(zero-shot evaluation), do we need to load any other weights, other than pre-loaded checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth (which it loads automatically)

If above sounds true, can you confirm for LVU relationship task in off-the-shelf setting

top1: 0.00 top5: 0.00 or I might be doing something wrong

Hi. For the LVU off-the-shelf setting, you need to add all the labels into the prompt when doing inference. For example, for the genre classification task, you need to set the prompt like this "please select the genre type of the input video from the list [action, comedy, romance, thriller]". Without the given answers, it is a hard task and the output accuracy might be 0.