Closed mcrchopra closed 1 month ago
Good questions!
The num_frames
in the configs are actually placeholders, and I changed them to 4
in the ipynb
. Since UMT uses sic-cos position embedding and is pretrained with 4 frames, I set it to 4 for better position interpolation.
In the instruction tuning, we use 8 frames, but we find 16 frames are better for different tasks. We do not use 16 frames considering the efficiency and effectiveness.
You can find the data details in data.py. We use only WebVid10M
and CC3M
in the stage2 of Mistral, since we find that more data is even harmful, leading to about 1~2% accuracy drop in MVBench.
Thank you for the quick response! This is really helpful!
Hi VidChat2 Team,
I'm looking at the Stage 2 Configs for both the Vicuna and Mistral variants. I had a few quick questions:
num_frames
different between the two configs? Mistral usesnum_frames=4
and Vicuna usesnum_frames=8
.demo_mistral.ipynb
,num_frames
is set to 16, even though at train time, only 4 frames were used. Why does this discrepancy exist between train and inference time? Why not just setnum_frames
to 16 during train time as well?