Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.54k stars 242 forks source link

[model]Understanding video with images as in-context #276

Open kassy11 opened 11 months ago

kassy11 commented 11 months ago

I want to give some images to the model as an in-cotext, then input the video and ask questions about the video content. (Specifically, I would like to teach the model the type of dogs as images and then have the model count the number of dogs in the video.) multimodal

The Otter-image model can be given an image as context, but no video can be input. And, the Otter-video model cannot be given an image as context, but video can be input.

Is there an optimal implementation method or model for this type of situation?

hcwei13 commented 10 months ago

I have the same needs!!! Have you solved it?