Closed davidsandberg closed 7 years ago
There is another one dataset https://www.microsoft.com/en-us/research/project/ms-celeb-1m-challenge-recognizing-one-million-celebrities-real-world/ but it's very dirty. I trained separate model to remove garbage. Still, it's around 10 million images and I have no resources to process all of them.
Hi @hudvin!
How did you plan to remove the garbage?
I did some tests with the casia dataset where I selected a subset of the casia images based on the distance for each image to the class center (the implementation can be found on the branch dataset_filtering
). It was just intended as an experiment to try to validate the method, but it actually improved the accuracy (when trained on casia only) from 0.984 to 0.988 which was a bit surprising. The plan is to apply the same principle on the MsCeleb dataset which contains a lot more label noise and see what happens.
I suppose there must be at least two stages:
Here are some results from the training: I still need to test some more agressive dataset filtering settings but so far the best LFW accuracy is around 0.994.
The decode program does work, but the comment and the assignment of img_name/img_string are inconsistent
# Column1: Freebase MID
# Column2: Query/Name
# Column3: ImageSearchRank
# Column4: ImageURL
# Column5: PageURL
# Column6: ImageData_Base64Encoded
.
.
.
img_name = fields[1] + '-' + fields[4] + '.' + args.output_format
img_string = fields[6]
The comment seems to be for the Full ImageThumnails version and the assignment for the cropped or the aligned version. The inconsistency could be fixed by changing the comment:
# Column1: Freebase MID
# Column2: ImageSearchRank
# Column3: ImageURL
# Column4: PageURL
# Column5: FaceID
# Column6: FaceRectangle_Base64Encoded (four floats, relative coordinates of UpperLeft and BottomRight corner)
# Column7: FaceData_Base64Encoded
The dataset is available here. A python program to decode the face thumbnails can be found here.