fwang91 / IMDb-Face

A new large-scale noise-controlled face recognition dataset.
432 stars 66 forks source link

Repetitions in file names and class labels #15

Open swghosh opened 4 years ago

swghosh commented 4 years ago

It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.

The following are referential to two different celebrities, yet possess the same class index.

Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format {class_index}_{filename.jpg} should mark a unique entry)

Hope this helps! Alternatively, please do let me know I was mistaken and those were on purpose like that.

Sample code to reproduce the problem.

import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')
+ 1662888 entries were found.
- 1632927 unique entries were found.
Apich238 commented 9 months ago

I downloaded dataset and it looks like "Kanchan" class is trash or error while "Ilias_Kanchan" is real class.

image