Closed terencezl closed 6 years ago
Face detection is not our main objective currently, you can try recent stronger detectors.
I see. Do the 1. relative margin (in percentage)
On Mar 1, 2018, at 8:33 AM, Jia Guo notifications@github.com wrote:
Face detection is not our main objective currently, you can try recent stronger detectors.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
You can try to change the parameters like scale
, min-size
and thresholds
of MTCNN. dlib is another solution but I suggest you to use it for detection only, then back to MTCNN for landmarks prediction.
Thanks. Tried min-size
and thresholds
. Will try scale
.
I looked at it here: https://github.com/deepinsight/insightface/blob/master/deploy/mtcnn_detector.py
The landmarks are generated in the same operation that generated the faces.
output = self.ONet.predict(input_buf)
#print(output[2])
# filter the total_boxes with threshold
passed = np.where(output[2][:, 1] > self.threshold[2])
total_boxes = total_boxes[passed]
if total_boxes.size == 0:
return None
total_boxes[:, 4] = output[2][passed, 1].reshape((-1,))
reg = output[1][passed]
points = output[0][passed]
So it seems it's really unavoidable that the face detection has to be done by mtcnn in order to do proper mtcnn landmark-based alignment, doesn't it?
No, you can crop the face by other face detectors, then use RNet of MTCNN to predict landmarks. MTCNN sometimes fail for profile faces. (profile faces are always harder)
Hi. I tried changing the image pyramid scale to other values and it didn't help.
meanwhile I found this tensorflow implementation: https://pypi.python.org/pypi/mtcnn, which returns a list of dicts in the following format:
[{'box': [107, 90, 145, 198],
'confidence': 0.99958628416061401,
'keypoints': {'left_eye': (128, 157),
'mouth_left': (130, 243),
'mouth_right': (175, 243),
'nose': (134, 198),
'right_eye': (185, 159)}}]
I was able to create a little conversion function to generate the (bbox, points)
tuple return by the detect_face_limited()
in https://github.com/deepinsight/insightface/blob/master/deploy/mtcnn_detector.py
.
def mtcnn_tf_to_mx(res_tf):
bbox_list = []
points_list =[]
if not res_tf:
return None
for r in res_tf:
bbox = r['box']
lms = r['keypoints']
bbox = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3], r['confidence']]
points = []
for i in ['left_eye', 'right_eye', 'nose', 'mouth_left', 'mouth_right']:
points.append(lms[i][0])
points.append(lms[i][1])
bbox_list.append(bbox)
points_list.append(points)
res_mx = (np.array(bbox_list), np.array(points_list))
return res_mx
Output:
(array([[ 107. , 90. , 252. , 288. ,
0.99958628]]),
array([[128, 157, 185, 159, 134, 198, 130, 243, 175, 243]]))
Sorry prematurely returned.
I feed the return tuple into https://github.com/deepinsight/insightface/blob/master/deploy/face_embedding.py#L65
I then tried to see if this method, also mtcnn but with a different implementation (tf), and possibly different training set, alters my identity comparison or not. It turns out the accuracy dropped a lot!
Again, the mxnet MTCNN detects less faces than the tf implementation, threshold 1.24.
# with mxnet MTCNN: all pairs are correctly distinguished.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4423997402191162, same person: False.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.4479265213012695, same person: False.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4347110986709595, same person: False.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.408543586730957, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4561747312545776, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.4844081401824951, same person: False.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4188538789749146, same person: False.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.4829577207565308, same person: False.
# with tf MTCNN: more faces but not all pairs are correctly distinguished.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.281417965888977, same person: False.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.1403337717056274, same person: True.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.1784759759902954, same person: True.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.3501794338226318, same person: False.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.1472257375717163, same person: True.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.2627432346343994, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.2442837953567505, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.0617015361785889, same person: True.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.1664326190948486, same person: True.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.2033170461654663, same person: True.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.107056975364685, same person: True.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.1129837036132812, same person: True.
This is only one of the sets I tested upon. Some cases are more extreme, that all distances between two different identities are below the 1.24 threshold.
I read that
Note that we do not require the input face image to be aligned but it should be cropped. We use (RNet+)ONet of MTCNN to further align the image before sending it to recognition network.
But it seems a little misalignment would cause a lot of trouble.
If I could use the included mtcnn mxnet implementation to get the bboxes and landmarks for the faces that are detectable by other mtcnn implementations and other detectors (dlib), it would no doubt solve this problem.
You also noted that
Align all face images of facescrub dataset and megaface distractors. Please check the alignment scripts under $INSIGHTFACE_ROOT/src/align/. (We may plan to release these data soon, not sure.)
With the included mtcnn under /deploy
, how were you able to detect all those faces used for training and megaface competition? Is there a different mtcnn implementation, with the same parameters somewhere in src/align
? I couldn't find it as there are many similar choices.
No, you can crop the face by other face detectors, then use RNet of MTCNN to predict landmarks. MTCNN sometimes fail for profile faces. (profile faces are always harder)
Seems in https://github.com/deepinsight/insightface/blob/master/deploy/mtcnn_detector.py#L409, RNet does not give landmarks.
output = self.RNet.predict(input_buf)
# filter the total_boxes with threshold
passed = np.where(output[1][:, 1] > self.threshold[1])
total_boxes = total_boxes[passed]
if total_boxes.size == 0:
return None
total_boxes[:, 4] = output[1][passed, 1].reshape((-1,))
reg = output[0][passed]
Instead, ONet does that. Also in fig1 (along with caption) of https://kpzhang93.github.io/MTCNN_face_detection_alignment/paper/spl.pdf, it says
After that, we refine these candidates in the next stage through a Refinement Network (R-Net). In the third stage, The Output Network (O-Net) produces final bounding box and facial landmarks position.
output = self.ONet.predict(input_buf)
#print(output[2])
# filter the total_boxes with threshold
passed = np.where(output[2][:, 1] > self.threshold[2])
total_boxes = total_boxes[passed]
if total_boxes.size == 0:
return None
total_boxes[:, 4] = output[2][passed, 1].reshape((-1,))
reg = output[1][passed]
points = output[0][passed]
But again, the code in mtcnn_detector.py
is hard to read... I know detection and preprocessing is not strictly your concern here, but judging by my tests, they seem more intertwined that not. I would really appreciate it if you could provide an alternative to the current mtcnn mxnet implementation, or some hybrid approach to be able to use the same detection bboxes and landmarks method (as the training sets) before alignment and nnet forwarding.
Thanks!
@terencezl Yes ONet produces landmarks, not RNet.
Different implementations must have some differences on the outputs. The code under /deploy
is just for reference and the included mxnet MTCNN is used to detect landmarks from your input images. So, from the working pipeline of MTCNN, the input of RNet and ONet are both cropped face image patch, you need to use one specific face detector before sending it to face_embedding
class(MTCNN as detector is also OK).
I use mtcnn for caffe(https://github.com/DuinoDu/mtcnn) to detect two more faces in a picture, and I use pyseeta(https://github.com/TuXiaokang/pyseeta) to align faces, and the result is more better. You can try it, just need to modify a little code in face_embedding.py.
I see. I tried a different MTCNN implementation https://github.com/ipazc/mtcnn, which is a rewrite of facenet's (https://github.com/davidsandberg/facenet/tree/master/src/align model file contents should be the same) to get the landmarks (points), and used them to do alignment, but the result is not as good as that using the mxnet MTCNN, when it is able to detect faces.
In addition, assuming the landmarks are close enough, I see the code in src/align
has David Sandberg's facenet implementation https://github.com/davidsandberg/facenet/tree/master/src/align. Did you use this detector/landmark detector, and also use his alignment strategy https://github.com/deepinsight/insightface/blob/master/src/align/align_dataset_mtcnn.py, or you used the detectors but carried out your own alignment as in
https://github.com/deepinsight/insightface/blob/master/src/align/align_insight.py https://github.com/deepinsight/insightface/blob/master/src/align/align_megaface.py https://github.com/deepinsight/insightface/blob/master/src/align/align_facescrub.py
?
I made a mistake during conversion. Now the accuracy is good. Thanks.
for i in ['left_eye', 'right_eye', 'nose', 'mouth_left', 'mouth_right']:
points.append(lms[i][0])
points.append(lms[i][1])
@terencezl what mistake you make?can you tell us? I think the mtcnn detector here is not good, many frontal face cannot be detected.
Is the mtcnn code from https://github.com/pangyupo/mxnet_mtcnn_face_detection? I also posted this issue there but that repo might not receive active attention anymore, so I wonder if anyone could provide any guidance here.
I see there are two methods,
detect_face()
anddetect_face_limited()
. Thedetect_face()
signature isBut the actual shape checking treats an image as a
[height, width, channel]
array:Therefore the detection fails at some point.
I also tried
detect_face_limited()
, and saw the[1]
and[2]
of threshold getting used to do the filtering and decreased those values to get more results. But still, e.g. this picture has four faces, and I'm getting as many as only one.How can I get all those faces?