Question about the query memory bank

LiJiaqi96 commented 3 weeks ago

Hi, thanks for your excellent work for long video understanding! I have some questions about the query memory bank:

What is the intuition of using the cascade Q-Former with l layers (rather than T)?
It seems that with the increase of time, the size of memory bank also increases until reaching the maximum limit. So does the model using different Q-Former for each time step t?
I could not understand the evolve of z_t. I suppose the query memory bank stores the [z_1, z_2, ..., z_t] but the figure 2 in the paper shows that there are several query memory banks. Could you please help me clarify this point?

Many thanks!

boheumd commented 3 weeks ago

Hi, thank you for your interest.

We follow the standard Q-Former architecture used in BLIP2/InstructBLIP, which consists of 12 Q-Former blocks.
Our model uses the same Q-Former. But during the online processing of video frames, the memory bank size will increase until reaching the maximum limit.
Each Q-Former block has its own query memory banks. So there are 12 query memory banks.

LiJiaqi96 commented 3 weeks ago

Thanks for your reply! I misunderstand the structure of your Q-Former. It is clear that the illustration is different blocks of Q-Former.

boheumd / MA-LMM