[model]Understanding video with images as in-context

I want to give some images to the model as an in-cotext, then input the video and ask questions about the video content. (Specifically, I would like to teach the model the type of dogs as images and then have the model count the number of dogs in the video.) multimodal

The Otter-image model can be given an image as context, but no video can be input. And, the Otter-video model cannot be given an image as context, but video can be input.

Is there an optimal implementation method or model for this type of situation?

Luodian / Otter

[model]Understanding video with images as in-context #276