Closed tsunghan-wu closed 1 month ago
Hi, Thanks for your interest in our work and sorry for the late reply. You're right most of the VLM have a native resolution of 224x224 or 336x336. However, we found that for most segmentation datasets, we can get better performance if we interpolate the positional encoding to allow the ViT to process higher-resolution images. In that case, we use bicubic interpolation of the position encoding so that the model can process 448x448, but you could also use it with any other resolution (though we found that 448x448 was the limit, after which we don't see any improvement)
I see. Thanks for the clarification!
Hey guys,
Thanks for the great work. I was wondering the function
gem.get_gem_img_transform()
. To the best of my knowledge, some pretrained CLIP-wise models are not 448x448 but 224x224 or 336x336. Should we pass the correct preprocessed image resolution to this function when we use the GEM model or it'd be fine to just using the default 448x448.Thanks and looking forward to the reply!
Best, Patrick