Closed Mamduh-k closed 2 years ago
CLIP may contain the information of unseen classes but only at image-level. No pixel-level information of unseen classes is available during training. Besides, the CLIP does not use any human annotated labels. In fact, when using CLIP models, it is a relaxed setting but has more practical values compared to the strict zero-shot semantic segmentation, which is similar to the setting of open vocabulary object detection[1, 2]. In addition, the proposed decoupling framework also improves the models in a strict zero-shot semantic segmentation setting.
[1] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In CVPR, , 2021. [2] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Zero-shot detection via vision and language knowledge distillation. arXiv, 2021.
I hope that I have illustrated it well. Feel free to reopen this issue.
Hello, author, you directly use CLIP model to classify the class-agnostic binary mask during the testing phase. This seems to violate the principle of zero-shot learning, because CLIP already has the information of unseen classes.