Hi, thanks for your excellent work for long video understanding! I have some questions about the query memory bank:
What is the intuition of using the cascade Q-Former with l layers (rather than T)?
It seems that with the increase of time, the size of memory bank also increases until reaching the maximum limit. So does the model using different Q-Former for each time step t?
I could not understand the evolve of z_t. I suppose the query memory bank stores the [z_1, z_2, ..., z_t] but the figure 2 in the paper shows that there are several query memory banks. Could you please help me clarify this point?
We follow the standard Q-Former architecture used in BLIP2/InstructBLIP, which consists of 12 Q-Former blocks.
Our model uses the same Q-Former. But during the online processing of video frames, the memory bank size will increase until reaching the maximum limit.
Each Q-Former block has its own query memory banks. So there are 12 query memory banks.
Many thanks!