Code release for "Cut and Learn for Unsupervised Object Detection and Instance Segmentation" and "VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation"
During training, multiple masks are generated for training. However, in ImageNet, there is only one object per image. How can this training approach be successful? Shouldn't we use the COCO dataset, which contains images with multiple objects?
MaskCut can detect an average of 2-3 objects per image on ImageNet since many images in the dataset contain multiple objects.
Moreover, we chose ImageNet to maintain consistency with previous unsupervised representation learning works and to demonstrate that training CutLER solely on ImageNet without any supervision can also enable CutLER to excel in challenging detection and instance segmentation tasks, without the need for specific training on dedicated detection datasets.
While MSCOCO contains more objects per image, it has fewer images compared to ImageNet and is less diverse. Consequently, training on MSCOCO alone results in worse performance on zero-shot detection and segmentation tasks. In this work, we use MSCOCO as the testing dataset for zero-shot unsupervised object detection.
We also conducted an ablation study on training the model on the YFCC datasets, which gives us comparable performance with training CutLER on ImageNet. Please check Table 11 for more details.
During training, multiple masks are generated for training. However, in ImageNet, there is only one object per image. How can this training approach be successful? Shouldn't we use the COCO dataset, which contains images with multiple objects?