Questions about processing multiple images in the form of context and video frames🤔?

Hello authors, I've been having a question about context and video frame format 🤔 that when I'm dealing with multiple images I have two options. That is, the video frames F!=1 and T=1 (for example, SD format), and the context form F=1, T!=1, I understand that the difference between them should be in the second Adapter-GATED XATTN-DENSE layers. If the form of a video frame is used, a prompt in the second adapter performs cross-attention with all video frames, while in the form of a context a prompt performs cross-attention with each picture alone. They then interact during self-attention in the language encoder. But I saw that the context form in the format in mimicit_dataset.py only splices the text of the context, rather than fusing the previous image and text to fuse with the current image and text. Do I understand it correctly? If not, please point it out. I would like to know if there is a contextual method other than text splicing and T<->F conversion?

Luodian / Otter

Questions about processing multiple images in the form of context and video frames🤔? #295