Why don't you use clip-ViT as the backbone?

NeilHnxTcc commented 10 months ago

As the title, thank you for your attention and help !

HarborYuan commented 10 months ago

Hi @NeilHnxTcc @windygoo,

Thanks for your interest.

For high-resolution images (e.g., 1024 x 1024), existing work mainly uses conv-based backbones since it has more flexibility on the resolution. If you use a ViT-based CLIP backbone, since the backbone is frozen in our method, there would be a strong misalignment between training and inference. During the training of the ViT-based backbone, it has a much smaller number of patches than in the inference. So we follow previous works [1][2] to use conv-based CLIP backbones.

Looking for using frozen ViT-based CLIP backbones with open-vocabulary dense prediction may be a promising research direction.

Please let me know if you have any other questions.

[1] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP. [2] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.

SxJyJay commented 10 months ago

Hi @NeilHnxTcc @windygoo,

Thanks for your interest.

For high-resolution images (e.g., 1024 x 1024), existing work mainly uses conv-based backbones since it has more flexibility on the resolution. If you use a ViT-based CLIP backbone, since the backbone is frozen in our method, there would be a strong misalignment between training and inference. During the training of the ViT-based backbone, it has a much smaller number of patches than in the inference. So we follow previous works [1][2] to use conv-based CLIP backbones.

Looking for using frozen ViT-based CLIP backbones with open-vocabulary dense prediction may be a promising research direction.

Please let me know if you have any other questions.

[1] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP. [2] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.

Thanks for your great work. I notice that you resize input images as (1024, 1024), while the official conv-based CLIP is pre-trained on the resolution of 256px/320px. So does it mean that you directly feed the high-resolution image input into the conv-based CLIP and extract its feature?

HarborYuan commented 10 months ago

Hi @NeilHnxTcc @windygoo, Thanks for your interest. For high-resolution images (e.g., 1024 x 1024), existing work mainly uses conv-based backbones since it has more flexibility on the resolution. If you use a ViT-based CLIP backbone, since the backbone is frozen in our method, there would be a strong misalignment between training and inference. During the training of the ViT-based backbone, it has a much smaller number of patches than in the inference. So we follow previous works [1][2] to use conv-based CLIP backbones. Looking for using frozen ViT-based CLIP backbones with open-vocabulary dense prediction may be a promising research direction. Please let me know if you have any other questions. [1] Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP. [2] F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models.

Thanks for your great work. I notice that you resize input images as (1024, 1024), while the official conv-based CLIP is pre-trained on the resolution of 256px/320px. So does it mean that you directly feed the high-resolution image input into the conv-based CLIP and extract its feature?

Exactly.

HarborYuan / ovsam

Why don't you use clip-ViT as the backbone? #11