MS-Celeb-1M dataset filtering (?) - Githubissues

davidsandberg / facenet

Face recognition using Tensorflow

MIT License

13.71k stars 4.8k forks source link

MS-Celeb-1M dataset filtering (?) #612

Open rrbarioni opened 6 years ago

rrbarioni commented 6 years ago

Hello @davidsandberg ,

"The best performing model has been trained on a subset of the MS-Celeb-1M dataset. This dataset is significantly larger but also contains significantly more label noise, and therefore it is crucial to apply dataset filtering on this dataset."

I was wondering what kind of filtering approaches were performed? And have you trained a model the same way but without performing any filtering on MS-Celeb-1M dataset (and, if so, what where the results)?

Thanks in advance.

Shahnawazgrewal commented 6 years ago

I devised a pipeline to clean MS-Celeb-1M dataset, I can share the process with you. @rrbarioni

rrbarioni commented 6 years ago

@Shahnawazgrewal Oh, that would be great! Thank you.

Shahnawazgrewal commented 6 years ago

I trained a model on Inception-ResNet-v1 model for VGGFACE2 using center loss. Once I have the model, I used this model to get embeddeding into a clustering algorithm (DBSCAN). It gives clusters of each identity and I selected the cluster with highest number of images.

rrbarioni commented 6 years ago

@Shahnawazgrewal excellent! For the DBSCAN algorithm, I imagine that you used the best threshold (distance between faces) obtained from the first model you've obtained to set the neighborhood radius distance. In order to tell if a node is a core point, did you set the minPts arbitrarily? or was it set depending on other information (for example, the number of images on the identity)? (in DBSCAN article on wikipedia, there is an example which uses minPts=4).

Thanks again in advance,

Shahnawazgrewal commented 6 years ago

If you have a high value, you will get images in a cluster that are seemingly similar: however if you use a low value you will get images that are similar inside a cluster. I have done numerous experiment to support this hypothesis. Are you using it for a publication. I am including these results in a paper along with cleaning pipeline that can be part of dataset creation process. @rrbarioni

rrbarioni commented 6 years ago

@Shahnawazgrewal Thank you! Have a good job.

Shahnawazgrewal commented 6 years ago

what are your plans @rrbarioni

rrbarioni commented 6 years ago

@Shahnawazgrewal Oh sorry, didn't see you asked a question before. Actually I'm just trying to obtain better results on LFW.

Best regards,

Shahnawazgrewal commented 6 years ago

do let me know if you need suggestion(s).? I modified cluster.py to accommodate cleaning process. @rrbarioni

rrbarioni commented 6 years ago

@Shahnawazgrewal Nice! Thanks.

Shahnawazgrewal commented 6 years ago

which platform you have for training? @rrbarioni

SiddhardhaSaran commented 6 years ago

One more method for filtering is by using the dlib face recognition model and applying Chinese whispers clustering (with a distance threshold of 0.5) to every folder and keeping only the largest image cluster and deleting all others. Need to install dlib for that.

Shahnawazgrewal commented 6 years ago

I did the same however I used a model trained on vggface2 dataset. And then select the largest cluster and removing other cluster. I improved the overall accuracy. @SiddhardhaSaran

hungnv21292 commented 5 years ago

Hi @Shahnawazgrewal ,

Could you please share with me about a pipeline to clean MS-Celeb-1M dataset. I am using Asian-Celeb to finetune FaceNet.

Thanks