An issue with the accuracy of facenet

github4f commented 7 years ago

Hi- I use a pipeline based on mtcnn for face detection and then facenet for face recognition (I did test both 20170511-185253.pb and 20170512-110547.pb). I perform this process in a while loop for 2 minutes in which frames/images are captured using opencv video capture function! Finally, I compare my own images/faces during this period of time using cosine and euclidean functions! But the results are not promising at all. Indeed, the error can reach up to 50% sometime among frames! Any idea why is that and how I can improve it?

1) Do I need to do normalization, or rgb2gray before giving my frames to mtcnn?

2) Does mtccn already perform alignment?

3) Based on align_dlib.py, I use the following line: scaled = misc.imresize(img, prealigned_scale, interp='bilinear')

rather than using: align.align(image_size, img, landmarkIndices=landmarkIndices, skipMulti=False, scale=scale)

4) What about alignment ( 3d alignment such as fb deep face)?

5) I use the following code in the while loop

input_image_size = 160

 ret, frame = video_capture.read()
  bounding_boxes, _ = detect_face.detect_face(frame, minsize, pnet, rnet, onet, threshold, factor)
  nrof_faces = bounding_boxes.shape[0]
           if nrof_faces > 0:
                 det = bounding_boxes[:, 0:4]
                img_size = np.asarray(frame.shape)[0:2]
                cropped = []
                scaled = []
                scaled_reshape = []
                bb = np.zeros((nrof_faces,4), dtype=np.int32)    

                 for i in range(nrof_faces):
                    emb_array = np.zeros((1, embedding_size))

                    bb[i][0] = det[i][0]
                    bb[i][1] = det[i][1]
                    bb[i][2] = det[i][2]
                    bb[i][3] = det[i][3]

                    cv2.rectangle(frame, (bb[i][0], bb[i][1]), (bb[i][2], bb[i][3]), (0, 255, 0), 2)    #boxing face             
                    cropped.append(frame[bb[i][1]:bb[i][3], bb[i][0]:bb[i][2], :])
                    cropped[i] = facenet.flip(cropped[i], False)

                    scaled.append(misc.imresize(cropped[i], (image_size, image_size), interp='bilinear'))

                    scaled[i] = cv2.resize(scaled[i], (input_image_size,input_image_size), interpolation=cv2.INTER_CUBIC)

                    scaled[i] = facenet.prewhiten(scaled[i])
                    scaled_reshape.append(scaled[i].reshape(-1,input_image_size,input_image_size,3))
                    feed_dict = {images_placeholder: scaled_reshape[i], phase_train_placeholder: False}
                    emb_array[0, :] = sess.run(embeddings, feed_dict=feed_dict)

                   #then I compare (cosine, etc.) with embeddings of my own face pictures (captured with same camera, same background, etc.):
                   dist_cosine= distance.cosine(saved_embedding ,emb_array[0, :])
                   dist_euclidean = distance.euclidean(embedding,emb_array[0, :])

fgervais commented 7 years ago

I'm not sure if that would help or not but you can take a look at my pull request https://github.com/davidsandberg/facenet/pull/450. It seems to be right about what your are doing except for the distance measurement at the end.

It's mostly all about the classifier in the end but at least for me, with an SVM trained with my face among a hundred or so more faces from LFW, it can identify my face pretty much all the time at about 20 frames a seconds.

Sometimes the confidence level is not that high but it's always about 50% more than the next best guess.

github4f commented 7 years ago

@fgervais
Thanks for the note.

I was wondering how many images (your own face) was in the training set of SVM?

I have two concerns with the classification: 1) when a new person is added to the Data Base, the whole SVM should be retrained! 2) In my application, it would be very hard to have access to more than couple of figures from the same person. How many figures I need from one person?

fgervais commented 7 years ago

I had about 20 images for each face.

The classifier is something I need to work a bit more on. I have a feeling that 20 images is too little but most faces in the LFW database have about 20 images. I didn't want to use too much pictures of my own face and having the classifier always guess the face as being me.

As for retraining the whole classifier, well retraining is not that bad depending on the hardware I guess. If not then maybe this?

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

vudung45 commented 7 years ago

Check out my repo that I composed a bigger project that I was working on this summer into a simple working package, https://github.com/vudung45/FaceRec

The result is somewhat decent IMO, you can write a simple script to input new data from pre-existing images easily based on the mediocre one I wrote in main,py, I already provided a simple piece of code that allow user to add new object into the dataset taking images from webcame.

Fork or star it if you are interested.

github4f commented 7 years ago

Hi- I did an experiment using about 120 figures of myself. You can find the hist in the following figure. According to the results, normally I see more than 75% similarity between my pictures, but sometime it drops below 70%!

In the following figure, I compare my own figures (120 figures) vs 120 figures of a friend of mine:

It is very interesting that sometime the similarity between I and my friend is beyond 60%. Based on the above figures, I decided to use 75% as the threshold in my code.

@vudung45 : I am wondering your code is consistent with my results or you reach better similarity among your figures (for just one person)? I noticed that you use a different approach for alignment.

I also did some investigation to see why there are so much similarity between I and my friend. I noticed that most of the time, the problem is rooted in wrong captured figures. for example, there are some figures that both eyes are not visible in the figures. Something like below:

figure drawing for dummies -mantesh-80-1

I noticed that openface (http://cmusatyalab.github.io/openface/), does use opencv cascade (eye), to detect false figures. But mtcnn cannot do that! I suggest that we also include eye cascade to remove those figures. Please let me know your idea (@fgervais ).

github4f commented 7 years ago

@vudung45 : Hi- I did test your code. It seems that the accuracy is not higher than the accuracy of davidsandberg model, although you have 3 different alignment metrics (left, center, right) in your database for comparsion! Did you train the facenet again? Do you have any new checkpoint for yourself? I noted that you just published a small version of your code. Maybe that is the reason we see a lower accuracy (e.g., when I move my head a little bit (30 degree), your code cannot recognize me). I appreciate your comment.

One more question: could you please comment on the algorithm you use to find right, center, and left please?

vudung45 commented 7 years ago

@github4f First, Thank you for checking it out. Here are my 2 cents that could possibly solve your accuracy problem.

If you want to test out my code more, try inputting new inputing like this:
Given the constrain of the facenet model's accuracy, there are many ways you can improve the accuracy in real world application. One of my suggestion would be to create a tracker for each detected face on screen, then run recognition on each of them in real time. Then, decide who is in each tracker after some number of frames ( 3 - 10 frames, depending on how fast your machine is). Keep doing the same thing until the tracker disappears or loses track. Your result can look somewhat like this: {"Unknown" :3, "PersonA": 1, "PersonB": 20} ---> This tracker is tracking PersonB This will definitely help with your accuracy problem, because the result will most likely leaning toward the right subject in the picture after some number of frames instead of one. I have tried this approach for the project I worked on this summer at a company I interned at. It worked pretty well with about 500 people in the dataset. The camera was mounted on the ceiling pointing downward at an entrance door. The server we had only had a CPU, and it was fast enough to be able to collect 5 frames of each face in less than 1 sec. One benefit of this approach is that the longer the person stays in front of the camera, the more accurate and confident the result is, as confidence points get incremented over time.
My algorithm to find right , center, left is pretty simple. I decide it based on MTCNN's facial landmarks output

github4f commented 7 years ago

@vudung45 Hi David- Thanks a lot for the explanation and your great ideas concerning the matter we discussed.

-I tried to use tracking techniques before to reduce the runtime issue which I have faced. Here is my question in stackoverlfow regarding this issue: https://stackoverflow.com/questions/46091964/boost-face-detection-recognition-process-in-python-opencv-tensorflow-cnn

-I was wondering if you have already implemented the tracking idea in your bigger project? Do you use opencv for the implementation e.g., MIL, KCF?

-The issue with tracking (in particular the implementation with opencv) is that sometime (e.g., when the user moves a little fast) it shows false result (false rect)! At least opencv suffers form this issue. That is why, I prefer to use face detection and recognition for each frame rather than using face detection/recognition in 1 frame and tracking for the upcoming frames to avoid the issue. I appreciate it if you could kindly share your experience with us.

-Regarding "inputting new inputing" which you stated: I am wondering if maximum 3 embedding (left, right, center) can be saved in the database for each person. Am I correct? or is it possible for example, I save 2 embedding for right, 2 for center, and 2 for left for just 1 person?

Thank you

github4f commented 7 years ago

@vudung45 One more issue regarding the code. Now you calculate the similarity percentage using the following equation: percentage = min(100, 100 * thres / smallest), where thre=0.6! I am note the equation is correct or not, since it depends on your own thres! I prefer to use cosine similarity which gives a value between 0 and 1. So you can directly convert it to percentage without requiring thres.

github4f commented 7 years ago

@vudung45 It seems that the following lines are also different:

rects, landmarks = face_detect.detect_face(frame, 80); # min face size is set to 80x80 aligned_frame, pos = aligner.align(160,frame,landmarks[i]);

According to mtcc the output should be like the following:

But when I run your code, I see the following output (below the lips is ignored):

And when I compare it with you comment (your own figure which you sent it here), I see the following:

Upper your eyebrow is not in the figure!

-So which one is correct? -Why you use landmark to detect the face? I use directly "rects". I think rects has the whole face. but your do some process on landmark and cut some parts of the face. Am I correct? Do you have another code for alignment?

github4f commented 7 years ago

@vudung45 I also observed that your function does not work correctly: rects, landmarks = face_detect.detect_face(image_list[0], 80); # min face size is set to 80x80 When I use "rects" to draw a rect around the face, it shows something very irreverent. But when I use the official alignment of davidsandberg, it works fine. I suggest that you load an image:

im=misc.imread(filename) and apply your facedetection on the image and drat the rect around the face to verify your function. If you have an updated function, please update the github so we can use it. Thank you

vudung45 commented 7 years ago

@github4f Sorry for lack of comments, I was just breaking it from my other project.

My mtcnn detection is not entirely the same as davidsandberg's one. It changed its datastucture up a little bit for my own comfort. If I remember correctly, each rect in rects is something like this: (x,y,w,h)
Rect is indeed the whole face, however, my AlignCustom class uses the 5 landmarks to align the face and try to crop it in a way that inner eyes, nose, and outer lips are around the "average positions of face points".
I have implemented the tracking method in my other project. OpenCV trackers worked pretty decently. I used them mostly to get predicted updated locations of the faces in the next frame then match them with the faces on screen then "update" the width and height of the bounding boxes since OpenCV trackers don't do that for you.
For inputting new subject matter, I find the mean/center of each of all the input face images categorized in 3 face different face positions(left, right, center). By doing that, the more data you have, the better center points are.

github4f commented 7 years ago

@vudung45 Thanks a lot for your comments. -Ya, I exactly use the rect output of your mtcnn based on (x,y,w,h). But it does not give correct output. But the landmark is fine and great. Do you have an updated version of your mtcnn?

-I am also wondering if you have any repository for "tracker". As you mentioned, opencv does not work. It would be great, if you could share it with us.

Thanks

vudung45 commented 7 years ago

@github4f

Is the output slightly off? That might have been because of the padding parameter I create for the align function, which is set to 0.1 as default. Try to adjust that if you want its to crop a bigger face region.
As for tracker, I did make my own tracker class implementing/reinforcing OpenCV tracker. I think I will be putting it together with this repo later.

Once again, thank you for checking it out :)

github4f commented 7 years ago

@vudung45 Not really, the returned rect is so wrong! See the following figure which is the output of your code:

btw, what kind of mtcnn you use? For me, sometime, it detects false faces like the below figure:

cjs210 commented 6 years ago

I had the same issue, and i have been trying to solve it. Here comes my two cents:

1) Every facial recognition algorithm will be affected by changes in illumination, if you want to make your system less sensitive, your training images should have different light settings, so that the result of recognition doesn't drop to drastically.

2) I have reached 85% of accuracy (90% a few times) of facial recognition using Haar Cascade available on OpenCV for face detection. The Haar Cascade file used was lbpcascade_frontalface_improved.xml if you want to try, here is the link -> https://github.com/opencv/opencv/blob/master/data/lbpcascades/lbpcascade_frontalface_improved.xml (No, i am not using LBP for face detection, the file was made using LBP, but you still have to use Haar Cascade)

It may not be as accurate as using MTCNN but it has been accurate enough for me and you can change settings to improve accuracy if you have to.

You have to supply training data using Haar Cascade too. Make an algorithm to detect faces using Haar Cascade, save the images, train your system with those faces and voila, accuracy improved! At least that was my experience using Facenet.

davidsandberg / facenet

An issue with the accuracy of facenet #452