Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.56k stars 241 forks source link

Questions about processing multiple images in the form of context and video frames🤔? #295

Closed xmc-andy closed 11 months ago

xmc-andy commented 11 months ago

Hello authors, I've been having a question about context and video frame format 🤔 that when I'm dealing with multiple images I have two options. That is, the video frames F!=1 and T=1 (for example, SD format), and the context form F=1, T!=1, I understand that the difference between them should be in the second Adapter-GATED XATTN-DENSE layers. If the form of a video frame is used, a prompt in the second adapter performs cross-attention with all video frames, while in the form of a context a prompt performs cross-attention with each picture alone. They then interact during self-attention in the language encoder. But I saw that the context form in the format in mimicit_dataset.py only splices the text of the context, rather than fusing the previous image and text to fuse with the current image and text. Do I understand it correctly? If not, please point it out. I would like to know if there is a contextual method other than text splicing and T<->F conversion?