I notice the paper (VideoChat2) points out that GPT4V uses 16 frames (which I assume are sampled uniformly). However, how is this input into the model given it accepts only single images at a time? Is there some sample prompt for this?
For fairness, we input all video models with a uniform sampling of 16 frames into the model. You can refer to the official cookbook for examples: OpenAI Cookbook.
I notice the paper (VideoChat2) points out that GPT4V uses 16 frames (which I assume are sampled uniformly). However, how is this input into the model given it accepts only single images at a time? Is there some sample prompt for this?