Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.56k stars 241 forks source link

About choosing dataset format and pre-training weights #306

Open xmc-andy opened 10 months ago

xmc-andy commented 10 months ago

Hello, authors! I have a question about choosing a dataset format and corresponding weights. I am doing a classification task with multiple images and prompt input. If multiple images are regarded as videos, there are two options: SD format (single \ + single \, where \ represents all images) and DC mode (single \ + multiple \) . I understand their difference lies in the use of prompt. DC mode is more suitable for each picture with detailed prompts, while SD mode is suitable for all pictures to use a unified prompt. Is my understanding correct?

In addition, I used the Image-MPT7B weight in SD mode before, but it seems that the Video-LLaMA7B-DenseCaption weight in DC/SD mode is more suitable for the video frame mode. Is my understanding correct?

Luodian commented 10 months ago

Yes, it's pretty correct! I suggest you use DC mode and use Video pretrained weights. You could see via our web demo, the backend model is Video-LLaMA7B-DC.

Remember to put the multiple images as frames in the [B, T, F, C, H, W]'s F dimension (debug at vision_x to see the actual dimension during your training) And I will suggest you to try both template:

1. <image> + prompt
2. <image><image>...<image> + prompt

For training DC, we use the first.

xmc-andy commented 10 months ago

Thank you so much!