Repetitions in file names and class labels

It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.

The following are referential to two different celebrities, yet possess the same class index.

Kanchan - nm0437156
Ilias_Kanchan - nm0437156

Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format {class_index}_{filename.jpg} should mark a unique entry)

Hope this helps! Alternatively, please do let me know I was mistaken and those were on purpose like that.

Sample code to reproduce the problem.

import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')

+ 1662888 entries were found.
- 1632927 unique entries were found.

fwang91 / IMDb-Face

Repetitions in file names and class labels #15