Open lucasjinreal opened 6 months ago
Hi, of course, it is possible. The main question is how to split the input image. A simple solution is to divide 448x448 to 4x336x336 images with overlap. However, the computational cost is actually the same as that of 672x672. In this case, I guess you can try to use 4x224x224 to represent 448, which brings low cost but slightly larger resolution.
Hi, how about directly make input 448, and then interpolate for clipvit, that's would add more visual tokens, is that possible and will enhance performance?
Hi, if we directly enlarge the input to 448, we need to fine-tune the ViT model. It could bring inferior performance if without a large amount of data.
Currently, there is an option to using two image grid to double the input, but this introduce sort of heavy compute.
I want just make the input resolution slightly large, say from 336 -> 448, while keep the Convnext input resolution same (although I think currenctly it should larger if base vision tower are larger).
Could that be possible? Can u give me some advisor how to adopt it?