WalBouss / GEM

[CVPR24] Official Implementation of GEM (Grounding Everything Module)
MIT License
86 stars 4 forks source link

Questions about the Image Resolution #6

Closed tsunghan-wu closed 1 month ago

tsunghan-wu commented 3 months ago

Hey guys,

Thanks for the great work. I was wondering the function gem.get_gem_img_transform(). To the best of my knowledge, some pretrained CLIP-wise models are not 448x448 but 224x224 or 336x336. Should we pass the correct preprocessed image resolution to this function when we use the GEM model or it'd be fine to just using the default 448x448.

Thanks and looking forward to the reply!

Best, Patrick

WalBouss commented 1 month ago

Hi, Thanks for your interest in our work and sorry for the late reply. You're right most of the VLM have a native resolution of 224x224 or 336x336. However, we found that for most segmentation datasets, we can get better performance if we interpolate the positional encoding to allow the ViT to process higher-resolution images. In that case, we use bicubic interpolation of the position encoding so that the model can process 448x448, but you could also use it with any other resolution (though we found that 448x448 was the limit, after which we don't see any improvement)

tsunghan-wu commented 1 month ago

I see. Thanks for the clarification!