boheumd / MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
https://boheumd.github.io/MA-LMM/
MIT License
178 stars 21 forks source link

Question about the query memory bank #22

Closed LiJiaqi96 closed 2 weeks ago

LiJiaqi96 commented 3 weeks ago

figure2 Hi, thanks for your excellent work for long video understanding! I have some questions about the query memory bank:

Many thanks!

boheumd commented 3 weeks ago

Hi, thank you for your interest.

  1. We follow the standard Q-Former architecture used in BLIP2/InstructBLIP, which consists of 12 Q-Former blocks.
  2. Our model uses the same Q-Former. But during the online processing of video frames, the memory bank size will increase until reaching the maximum limit.
  3. Each Q-Former block has its own query memory banks. So there are 12 query memory banks.
LiJiaqi96 commented 3 weeks ago

Thanks for your reply! I misunderstand the structure of your Q-Former. It is clear that the illustration is different blocks of Q-Former.