Questions about how to enlarge the base vision tower input resolution

dvlab-research / MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Apache License 2.0

3.2k stars 277 forks source link

Questions about how to enlarge the base vision tower input resolution #48

Open lucasjinreal opened 6 months ago

lucasjinreal commented 6 months ago

Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.

I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).

Could that be possible? Can u give me some advisor how to adopt it?

yanwei-li commented 6 months ago

Hi, of course, it is possible. The main question is how to split the input image. A simple solution is to divide 448x448 to 4x336x336 images with overlap. However, the computational cost is actually the same as that of 672x672. In this case, I guess you can try to use 4x224x224 to represent 448, which brings low cost but slightly larger resolution.

lucasjinreal commented 6 months ago

Hi, how about directly make input 448, and then interpolate for clipvit, that's would add more visual tokens, is that possible and will enhance performance?

yanwei-li commented 6 months ago

Hi, if we directly enlarge the input to 448, we need to fine-tune the ViT model. It could bring inferior performance if without a large amount of data.