Open rrbarioni opened 6 years ago
I devised a pipeline to clean MS-Celeb-1M dataset, I can share the process with you. @rrbarioni
@Shahnawazgrewal Oh, that would be great! Thank you.
I trained a model on Inception-ResNet-v1 model for VGGFACE2 using center loss. Once I have the model, I used this model to get embeddeding into a clustering algorithm (DBSCAN). It gives clusters of each identity and I selected the cluster with highest number of images.
@Shahnawazgrewal excellent! For the DBSCAN algorithm, I imagine that you used the best threshold (distance between faces) obtained from the first model you've obtained to set the neighborhood radius distance. In order to tell if a node is a core point, did you set the minPts arbitrarily? or was it set depending on other information (for example, the number of images on the identity)? (in DBSCAN article on wikipedia, there is an example which uses minPts=4).
Thanks again in advance,
If you have a high value, you will get images in a cluster that are seemingly similar: however if you use a low value you will get images that are similar inside a cluster. I have done numerous experiment to support this hypothesis. Are you using it for a publication. I am including these results in a paper along with cleaning pipeline that can be part of dataset creation process. @rrbarioni
@Shahnawazgrewal Thank you! Have a good job.
what are your plans @rrbarioni
@Shahnawazgrewal Oh sorry, didn't see you asked a question before. Actually I'm just trying to obtain better results on LFW.
Best regards,
do let me know if you need suggestion(s).? I modified cluster.py to accommodate cleaning process. @rrbarioni
@Shahnawazgrewal Nice! Thanks.
which platform you have for training? @rrbarioni
One more method for filtering is by using the dlib face recognition model and applying Chinese whispers clustering (with a distance threshold of 0.5) to every folder and keeping only the largest image cluster and deleting all others. Need to install dlib for that.
I did the same however I used a model trained on vggface2 dataset. And then select the largest cluster and removing other cluster. I improved the overall accuracy. @SiddhardhaSaran
Hi @Shahnawazgrewal ,
Could you please share with me about a pipeline to clean MS-Celeb-1M dataset. I am using Asian-Celeb to finetune FaceNet.
Thanks
Hello @davidsandberg ,
"The best performing model has been trained on a subset of the MS-Celeb-1M dataset. This dataset is significantly larger but also contains significantly more label noise, and therefore it is crucial to apply dataset filtering on this dataset."
I was wondering what kind of filtering approaches were performed? And have you trained a model the same way but without performing any filtering on MS-Celeb-1M dataset (and, if so, what where the results)?
Thanks in advance.