Poor score when using pre-trained model on lfw - Githubissues

davidsandberg / facenet

Face recognition using Tensorflow

MIT License

13.83k stars 4.81k forks source link

Poor score when using pre-trained model on lfw #594

Closed kmonachopoulos closed 6 years ago

kmonachopoulos commented 6 years ago

I am trying to create a script that is able to evaluate facenet on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160,3). For some reason if I change the image size, the embedding size changes as well ...

Anyway, to evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size (2,160,160,3) from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing (1000 pairs).

Using this architecture, testing both .pb models that provided here I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings on face verification?

spantazi commented 6 years ago

@kmonachopoulos I think that the validation uses KNN to find the best embeddings distance threshold. Take a look at validate_on_lfw.py , lfw.py and facenet.py

kmonachopoulos commented 6 years ago

@spantazi I thought that KNN is used to match the closest person in the db with respect to the closest euclidean distance. But this is used for identification and not verification. I don't want to identify the person with 1 to all comparison but to find if two instances belong to the same class (same person), so I don't care who is the person but if the person is who he/she claims to be (verification). In the paper it is said that at the edge of the pipeline a classifier will be used to determine if two persons belong to the same class extracting 1 if different and 0 otherwise.

The problem is that the embeddings, as outputs on different users hold identical values so the distance is still small and the classifier, being linear, cannot distinguish between different classes.

The model should work straight away, but it doesn't and I cannot find out why ..

spantazi commented 6 years ago

@kmonachopoulos Yes indeed KNN can be used for identification, not verification. But if you can find the maximum distance between any pair of face embeddings belonging to the same person that produces the best validation score (exactly as validate_on_lfw does), then you can use that as a threshold for your verifier. In theory, if the pre-trained facenet model is generic enough and produces good feature embeddings, you can use that threshold to verify new pairs of faces (i.e. from your dataset) without the use of an SVM classifier. See : #413 and #375

kmonachopoulos commented 6 years ago

@spantazi I see your point here, but in theory an SVM classifier will produce better results rather than choosing just the maximum intra - class distance as a threshold. The maximum distance could come from an outlier and I don't think that setting the threshold to that value will be wise. On the other hand using the euclidean feature space to estimate a threshold on the ROC curve based on FAR and FRR could produce better results.

My issue here is that the score is extremely poor using the SVM (60%) and it should be better.. It's not that I want to go from 98 to 99 percent and choosing specific classifiers (e.g svm vs logistic regression vs mahalanobis distance e.t.c ) can produce that.

spantazi commented 6 years ago

@kmonachopoulos Perhaps the problem is that you feed only 1 parameter (the euclidean distance) to your SVM. Maybe you need to input 2x128d (embeddings of 2 faces) in order to output the verification result (1 or 0). That way your input dimensionallity will generate an SVM with more insight into your data. You may also need to use more than 1000 pairs for training. As far as concerns the vgg-face pre-trained model, which one have you used? If it is a triplet loss pre-trained model, then keep in mind that those models are maximizing the inter-class distance and minimize the intra-class distance, so that's the reason for better scores when feeding the euclidean distance in an SVM classifier. Finally, which align@crop method do you use? You should first pass your photos from the MTCNN align process, in order to produce the 160x160x3 input image for the facenet model (see load_and_align_data in compare.py)

kmonachopoulos commented 6 years ago

@spantazi I have tried that and the performance is still poor. Besides, in the paper it is said that the classification takes place using the Euclidean distance and not the embeddings. This is what it says about the evaluation : "given a pair of two face images a squared L2 distance threshold D(xi, xj ) is used to determine the classification of same and different."

In my implementation image pairs are 1100 for training and 1000 for testing. About the model selection, vgg-face model is trained for identification using softmax loss and turned into verification extracting the features from previous layer. Supposedly, facenet model should provide better results since this model use triplet loss and not the other. Face detection and landmark point detection implemented through dlib and affine transform through opencv.

spantazi commented 6 years ago

@kmonachopoulos See the wiki guide on how to align the lfw dataset before validation. It is important to use the mtcnn face aligning/cropping with the default margins (160*160 and 32 pxs margin), otherwise the accuracy is bad. I believe that the use of dlib for face cropping is responsible for your bad scores.

maestrojeong commented 6 years ago

@davidsandberg @spantazi Although I used mtcnn face aligning/cropping with the default margins(160x160 for the image size and 32 pxs as a margin) as in wiki, the accuracy was 98% not 99% in wiki. The VAL@FAR=0.1% was 90% not 97%. Area or auc curve was 0.98 not 1.000. Is there anyone who get the correct score when following the "validate on lfw" in wiki?

spantazi commented 6 years ago

@maestrojeong Which pre-trained model did you use? The model trained with the Casia-webface dataset has an accuracy of 98 not 99.

maestrojeong commented 6 years ago

@spantazi pretrained model I used this one. Thank you

spantazi commented 6 years ago

@maestrojeong Please read other issues concerning the validation with the LFW dataset in case you missed something in the validation process.You should get 99% with the pretrained model you used.