Closed amundra15 closed 1 month ago
I expect the dataset should contain some images with multiple objects, as we haven't filtered those out in any way. Releasing a sample dataset is a complex process and therefore I would recommend not planning around that.
It is unclear if the model was trained on object-centric images (eg. ImageNet) or scene-level images. In Sec. 3, the authors mention retrieving the dataset by crawling the web. Does this mean that the dataset contains scenes composed of multiple objects?
Perhaps you could release a small sample dataset to give a general idea.