Open jpan72 opened 1 day ago
224 is the input resolution of our vision encoder. You can refer to the dynamic resolution setting of HD https://github.com/OpenGVLab/Ask-Anything/blob/c3f07988b1db77ed24d706650d3cb23e3495a011/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L85-L90
Hello,
Thank you for the great work!
For stage 4 (instruction tuning with HD data), the current code seems to resize/crop image to 224x224: https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/scripts/videochat_mistral/config_7b_hd_stage4.py#L21 https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/dataset/__init__.py#L73
which means it's actually using 224x224 frames for training. Is that true? If so, what is this "HD" about? Or did I miss something?
Thank you!