facebookresearch / CutLER

Code release for "Cut and Learn for Unsupervised Object Detection and Instance Segmentation" and "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation"
Other
945 stars 93 forks source link

why use imagenet to pretrain CutLER #39

Closed chos1npc closed 1 year ago

chos1npc commented 1 year ago

During training, multiple masks are generated for training. However, in ImageNet, there is only one object per image. How can this training approach be successful? Shouldn't we use the COCO dataset, which contains images with multiple objects?

frank-xwang commented 1 year ago

Hey! We use ImageNet because:

  1. MaskCut can detect an average of 2-3 objects per image on ImageNet since many images in the dataset contain multiple objects.
  2. Moreover, we chose ImageNet to maintain consistency with previous unsupervised representation learning works and to demonstrate that training CutLER solely on ImageNet without any supervision can also enable CutLER to excel in challenging detection and instance segmentation tasks, without the need for specific training on dedicated detection datasets.
  3. While MSCOCO contains more objects per image, it has fewer images compared to ImageNet and is less diverse. Consequently, training on MSCOCO alone results in worse performance on zero-shot detection and segmentation tasks. In this work, we use MSCOCO as the testing dataset for zero-shot unsupervised object detection.
  4. We also conducted an ablation study on training the model on the YFCC datasets, which gives us comparable performance with training CutLER on ImageNet. Please check Table 11 for more details.

Please let me know if you have further questions.

frank-xwang commented 1 year ago

Closing it now. Please feel free to reopen it if you have further questions.