PatrykChrabaszcz / Imagenet32_Scripts

Scripts for Imagenet 32 dataset
MIT License
150 stars 47 forks source link

Validation set duplicate images #8

Closed syzymon closed 3 years ago

syzymon commented 3 years ago

Hi,

are you aware that the validation set of 49999 images downloaded from here: http://image-net.org/small/valid_32x32.tar

has a lot of duplicate images? Reproduction of a few examples: 02273.png and 42263.png 04990.png and 45295.png

overall, there are only ~45047 unique images in the validation set - about 5k of them occur twice, and a few even three times. Is that intended to give some examples more weight for validation score, or rather a bug? + wondering if it also applies to 64x64 version - haven't tested that yet

Thanks

PatrykChrabaszcz commented 3 years ago

The process of dataset generation was automated. Do you know if original imagenet also contains duplicates? If not then something indeed is wrong.

PatrykChrabaszcz commented 3 years ago

Please be aware that the dataset generated by us is hosted here: http://image-net.org/download-images

Your link refers to another dataset. Maybe this is the source of the confusion.

syzymon commented 3 years ago

I have downloaded images from here: http://image-net.org/download-images and indeed validation set contains only 29 duplicates - that's much better. The dataset with a lot of duplicates (http://image-net.org/small/valid_32x32.tar) comes from this paper: https://arxiv.org/pdf/1601.06759.pdf but I don't actually know where they took the data from, still seems to be the case that raw .png data has already duplicates in it - sorry for the confusion.

Thanks