V3Det / Detectron2-V3Det

Detectron2 Toolbox and Benchmark for V3Det
Apache License 2.0
15 stars 2 forks source link

Question about step 2 of Open-Vocabulary Detection #5

Open HEasoner opened 5 months ago

HEasoner commented 5 months ago

Hi @yhcao6 when I run the second step of Open-Vocabulary Detection:python tools/v3det_ovd_utils/split_base_novel.py datasets/V3Det/annotations/v3det_2023_v1_train.json, I found this step to do was to find 100 images in each v3det novel category on imagenet. image = { 'id': count, 'file_name': file_name, 'pos_category_ids': [cat_id], 'width': w, 'height': h } count = count + 1 cat_images.append(image) cls_cnt += 1 if cls_cnt > 100: break Maybe I was wrong. In the challenge: An exemplar object-centric image is available for each category. But there are 100 images. If the one image is somewhere else, what is the purpose of these 100 pictures, what is the purpose of this step 2?

yhcao6 commented 5 months ago

The images here from the ImageNet are not examples of what we mean by "exemplar images." To see what we mean by exemplar images, please check this link. For every novel class, we have provided a specific exemplar image. It’s important not to manually label these exemplar images or use them as training data.

You can use images from other public and academic sources for your work, which includes different kinds of datasets such as image detection, image classification (like ImageNet), and datasets that combine text and images. Essentially, you are free to use as many images as you need from these sources as long as they meet the criteria mentioned above (we select the top 100 images here just for the purpose of baseline). However, it's important to explain how you used these datasets in your technical report.

Please refer the the challenge page (2.a) for more details.

twangnh commented 5 months ago

Hi @yhcao6, thanks for hosting the challenge, but the setting is open-vocabulary object detection, if we use other source of data that contains the novel categories, how to make sure it does not violate the open-vocabulary setting? e.g., we simply use the bamboo images for the novel categories during training, which may lead to significant improvement on novel categories.

yhcao6 commented 5 months ago

In open-vocabulary object detection, you can use data of novel classes in the form of image-text pairs, classification images (image label is a q unique caption). The important thing is that you are not allowed to use data of novel classes with manually labeled bounding boxes. Here are some papers of OVD:

  1. Detic utilizes classification data ImageNet https://arxiv.org/abs/2201.02605
  2. OVR-CNN utilizes image-text pairs data https://arxiv.org/abs/2011.10678
twangnh commented 5 months ago

@yhcao6 thanks for the reply, we note in the prior OVD methods, they did not explicitly use images of novel categories, Detic's use of image classification data is a wealy-supervised way of object detection, and the novel class is known in advance, which may not be strict to evaluate the OPEN vocabulary detection capability. We thus believe it is better to seperate the setting where such explicit use of image classification data for novel classes are used and not used. For example, setting1: the novel class is not known in advance setting2: the novel class is known during training and unrestricted number of images for each class can be used