Closed NeilHnxTcc closed 10 months ago
Hi @NeilHnxTcc @windygoo,
Thanks for your interest.
For high-resolution images (e.g., 1024 x 1024), existing work mainly uses conv-based backbones since it has more flexibility on the resolution. If you use a ViT-based CLIP backbone, since the backbone is frozen in our method, there would be a strong misalignment between training and inference. During the training of the ViT-based backbone, it has a much smaller number of patches than in the inference. So we follow previous works [1][2] to use conv-based CLIP backbones.
Looking for using frozen ViT-based CLIP backbones with open-vocabulary dense prediction may be a promising research direction.
Please let me know if you have any other questions.
[1] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP. [2] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.
Hi @NeilHnxTcc @windygoo,
Thanks for your interest.
For high-resolution images (e.g., 1024 x 1024), existing work mainly uses conv-based backbones since it has more flexibility on the resolution. If you use a ViT-based CLIP backbone, since the backbone is frozen in our method, there would be a strong misalignment between training and inference. During the training of the ViT-based backbone, it has a much smaller number of patches than in the inference. So we follow previous works [1][2] to use conv-based CLIP backbones.
Looking for using frozen ViT-based CLIP backbones with open-vocabulary dense prediction may be a promising research direction.
Please let me know if you have any other questions.
[1] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP. [2] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.
Thanks for your great work. I notice that you resize input images as (1024, 1024), while the official conv-based CLIP is pre-trained on the resolution of 256px/320px. So does it mean that you directly feed the high-resolution image input into the conv-based CLIP and extract its feature?
Hi @NeilHnxTcc @windygoo, Thanks for your interest. For high-resolution images (e.g., 1024 x 1024), existing work mainly uses conv-based backbones since it has more flexibility on the resolution. If you use a ViT-based CLIP backbone, since the backbone is frozen in our method, there would be a strong misalignment between training and inference. During the training of the ViT-based backbone, it has a much smaller number of patches than in the inference. So we follow previous works [1][2] to use conv-based CLIP backbones. Looking for using frozen ViT-based CLIP backbones with open-vocabulary dense prediction may be a promising research direction. Please let me know if you have any other questions. [1] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP. [2] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.
Thanks for your great work. I notice that you resize input images as (1024, 1024), while the official conv-based CLIP is pre-trained on the resolution of 256px/320px. So does it mean that you directly feed the high-resolution image input into the conv-based CLIP and extract its feature?
Exactly.
As the title, thank you for your attention and help !