OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
https://vchat.opengvlab.com/
MIT License
3.03k stars 248 forks source link

Question regarding stage 4 HD image size #244

Open jpan72 opened 1 day ago

jpan72 commented 1 day ago

Hello,

Thank you for the great work!

For stage 4 (instruction tuning with HD data), the current code seems to resize/crop image to 224x224: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L21 https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/dataset/__init__.py#L73

which means it's actually using 224x224 frames for training. Is that true? If so, what is this "HD" about? Or did I miss something?

Thank you!

yinanhe commented 1 day ago

224 is the input resolution of our vision encoder. You can refer to the dynamic resolution setting of HD https://github.com/OpenGVLab/Ask-Anything/blob/c3f07988b1db77ed24d706650d3cb23e3495a011/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L85-L90