VideoChat2 Mistral Vs Vicuna Configs

mcrchopra commented 1 month ago

Hi VidChat2 Team,

I'm looking at the Stage 2 Configs for both the Vicuna and Mistral variants. I had a few quick questions:

Why are the num_frames different between the two configs? Mistral uses num_frames=4 and Vicuna uses num_frames=8.
At inference time in the demo_mistral.ipynb, num_frames is set to 16, even though at train time, only 4 frames were used. Why does this discrepancy exist between train and inference time? Why not just set num_frames to 16 during train time as well?
What's the difference between "webvid10m_cc14m_plus" (the dataset corpus used in the Vicuna Config) and "webvid10m_cc3m" (the dataset corpus used in the Mistral Config)? Are these datasets only made up of data from the Webvid and CCM14M/CC3M? In the paper, you all mention that you also utilize 10M captions from InternVid and 2M captions from COCO and other assorted image datasets.

Andy1621 commented 1 month ago

Good questions!

The num_frames in the configs are actually placeholders, and I changed them to 4 in the ipynb. Since UMT uses sic-cos position embedding and is pretrained with 4 frames, I set it to 4 for better position interpolation.
In the instruction tuning, we use 8 frames, but we find 16 frames are better for different tasks. We do not use 16 frames considering the efficiency and effectiveness.
You can find the data details in data.py. We use only WebVid10M and CC3M in the stage2 of Mistral, since we find that more data is even harmful, leading to about 1~2% accuracy drop in MVBench.

mcrchopra commented 1 month ago

Thank you for the quick response! This is really helpful!

OpenGVLab / Ask-Anything