Closed fushh closed 8 months ago
What you are referring to should be the zero-shot setting. This paper is about the open-vocabulary setting. open-vocabulary object detection differs from zero-shot object detection in that it can access large-scale novel objects with weakly-supervised labels, e.g., tags, and captions. For example, Detic use COCO captions(which cover all 80 categories), but do not touch any noval object's bounding boxes annotations.
What you are referring to should be the zero-shot setting. This paper is about the open-vocabulary setting. open-vocabulary object detection differs from zero-shot object detection in that it can access large-scale novel objects with weakly-supervised labels, e.g., tags, and captions. For example, Detic use COCO captions(which cover all 80 categories), but do not touch any noval object's bounding boxes annotations.
Thank you for reading our paper. Please kindly refer to "A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future"(https://arxiv.org/abs/2307.09220) for more detail about zero-shot and open-vocabulary.
In Table 1, the WSOVOD trained using COCO image level labels achieves 35.0 APn. It seems that the authors mistakenly use all 80 categories as the image level labels, which breaks the open-vocabulary setting. But I am not sure if my understanding is correct. If it is incorrect, please point it out.