deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
23.19k stars 5.39k forks source link

How to tweak the parameters to get more faces? (some imgs that have faces but not detected) #68

Closed terencezl closed 6 years ago

terencezl commented 6 years ago

Is the mtcnn code from https://github.com/pangyupo/mxnet_mtcnn_face_detection? I also posted this issue there but that repo might not receive active attention anymore, so I wonder if anyone could provide any guidance here.

image

I see there are two methods, detect_face() and detect_face_limited(). The detect_face() signature is

            img: numpy array, bgr order of shape (1, 3, n, m)
                input image

But the actual shape checking treats an image as a [height, width, channel] array:

        # only works for color image
        if len(img.shape) != 3:
            return None

        # detected boxes
        total_boxes = []

        height, width, _ = img.shape

Therefore the detection fails at some point.

I also tried detect_face_limited(), and saw the [1] and [2] of threshold getting used to do the filtering and decreased those values to get more results. But still, e.g. this picture has four faces, and I'm getting as many as only one.

detector = MtcnnDetector(model_folder='mtcnn-model', ctx=mx.gpu(0), num_worker=1, accurate_landmark = True, threshold=[0.0,-1,-1], minsize=10)
img = cv2.imread("180112-news-friends.jpg")
detector.detect_face_limited(img, 2)

(array([[  5.78239929e+02,  -3.24641838e+01,   1.59770996e+03,
           8.00984070e+02,   2.69610956e-02]], dtype=float32),
 array([[ 932, 1277, 1178,  954, 1229,  302,  275,  414,  628,  599]], dtype=int32))

How can I get all those faces?

nttstar commented 6 years ago

Face detection is not our main objective currently, you can try recent stronger detectors.

terencezl commented 6 years ago

I see. Do the 1. relative margin (in percentage)

  1. Width and height of the input face clip image 3. Alignment strategy (currently preprocessing uses landmark points from the mtcnn detector, should I do something similar with, say dlib’s alignment?) matter?

On Mar 1, 2018, at 8:33 AM, Jia Guo notifications@github.com wrote:

Face detection is not our main objective currently, you can try recent stronger detectors.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

nttstar commented 6 years ago

You can try to change the parameters like scale, min-size and thresholds of MTCNN. dlib is another solution but I suggest you to use it for detection only, then back to MTCNN for landmarks prediction.

terencezl commented 6 years ago

Thanks. Tried min-size and thresholds. Will try scale.

I looked at it here: https://github.com/deepinsight/insightface/blob/master/deploy/mtcnn_detector.py

The landmarks are generated in the same operation that generated the faces.

        output = self.ONet.predict(input_buf)
        #print(output[2])

        # filter the total_boxes with threshold
        passed = np.where(output[2][:, 1] > self.threshold[2])
        total_boxes = total_boxes[passed]

        if total_boxes.size == 0:
            return None

        total_boxes[:, 4] = output[2][passed, 1].reshape((-1,))
        reg = output[1][passed]
        points = output[0][passed]

So it seems it's really unavoidable that the face detection has to be done by mtcnn in order to do proper mtcnn landmark-based alignment, doesn't it?

nttstar commented 6 years ago

No, you can crop the face by other face detectors, then use RNet of MTCNN to predict landmarks. MTCNN sometimes fail for profile faces. (profile faces are always harder)

terencezl commented 6 years ago

Hi. I tried changing the image pyramid scale to other values and it didn't help.

meanwhile I found this tensorflow implementation: https://pypi.python.org/pypi/mtcnn, which returns a list of dicts in the following format:

[{'box': [107, 90, 145, 198],
 'confidence': 0.99958628416061401,
 'keypoints': {'left_eye': (128, 157),
  'mouth_left': (130, 243),
  'mouth_right': (175, 243),
  'nose': (134, 198),
  'right_eye': (185, 159)}}]

I was able to create a little conversion function to generate the (bbox, points) tuple return by the detect_face_limited() in https://github.com/deepinsight/insightface/blob/master/deploy/mtcnn_detector.py .

def mtcnn_tf_to_mx(res_tf):
    bbox_list = []
    points_list =[]

    if not res_tf:
        return None

    for r in res_tf:
        bbox = r['box']
        lms = r['keypoints']
        bbox = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3], r['confidence']]

        points = []
        for i in ['left_eye', 'right_eye', 'nose', 'mouth_left', 'mouth_right']:
            points.append(lms[i][0])
            points.append(lms[i][1])

        bbox_list.append(bbox)
        points_list.append(points)

    res_mx = (np.array(bbox_list), np.array(points_list))
    return res_mx

Output:

(array([[ 107.        ,   90.        ,  252.        ,  288.        ,
            0.99958628]]),
 array([[128, 157, 185, 159, 134, 198, 130, 243, 175, 243]]))
terencezl commented 6 years ago

Sorry prematurely returned.

I feed the return tuple into https://github.com/deepinsight/insightface/blob/master/deploy/face_embedding.py#L65

I then tried to see if this method, also mtcnn but with a different implementation (tf), and possibly different training set, alters my identity comparison or not. It turns out the accuracy dropped a lot!

Again, the mxnet MTCNN detects less faces than the tf implementation, threshold 1.24.

# with mxnet MTCNN: all pairs are correctly distinguished.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4423997402191162, same person: False.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.4479265213012695, same person: False.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4347110986709595, same person: False.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.408543586730957, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4561747312545776, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.4844081401824951, same person: False.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.4188538789749146, same person: False.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.4829577207565308, same person: False.

# with tf MTCNN: more faces but not all pairs are correctly distinguished.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.281417965888977, same person: False.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.1403337717056274, same person: True.
The distance between nm3081796_rm2230957824_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.1784759759902954, same person: True.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.3501794338226318, same person: False.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.1472257375717163, same person: True.
The distance between nm3081796_rm3459500800_1983-12-21_2014.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.2627432346343994, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.2442837953567505, same person: False.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.0617015361785889, same person: True.
The distance between nm3081796_rm3914565376_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.1664326190948486, same person: True.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm1150527232_1973-0-0_2013.jpg is 1.2033170461654663, same person: True.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm1621990144_1973-0-0_2006.jpg is 1.107056975364685, same person: True.
The distance between nm3081796_rm666675968_1983-12-21_2010.jpg and nm0510912_rm198946816_1973-0-0_2011.jpg is 1.1129837036132812, same person: True.

This is only one of the sets I tested upon. Some cases are more extreme, that all distances between two different identities are below the 1.24 threshold.

I read that

Note that we do not require the input face image to be aligned but it should be cropped. We use (RNet+)ONet of MTCNN to further align the image before sending it to recognition network.

But it seems a little misalignment would cause a lot of trouble.

If I could use the included mtcnn mxnet implementation to get the bboxes and landmarks for the faces that are detectable by other mtcnn implementations and other detectors (dlib), it would no doubt solve this problem.

You also noted that

Align all face images of facescrub dataset and megaface distractors. Please check the alignment scripts under $INSIGHTFACE_ROOT/src/align/. (We may plan to release these data soon, not sure.)

With the included mtcnn under /deploy, how were you able to detect all those faces used for training and megaface competition? Is there a different mtcnn implementation, with the same parameters somewhere in src/align? I couldn't find it as there are many similar choices.

terencezl commented 6 years ago

No, you can crop the face by other face detectors, then use RNet of MTCNN to predict landmarks. MTCNN sometimes fail for profile faces. (profile faces are always harder)

Seems in https://github.com/deepinsight/insightface/blob/master/deploy/mtcnn_detector.py#L409, RNet does not give landmarks.


          output = self.RNet.predict(input_buf)

          # filter the total_boxes with threshold
          passed = np.where(output[1][:, 1] > self.threshold[1])
          total_boxes = total_boxes[passed]

          if total_boxes.size == 0:
              return None

          total_boxes[:, 4] = output[1][passed, 1].reshape((-1,))
          reg = output[0][passed]

Instead, ONet does that. Also in fig1 (along with caption) of https://kpzhang93.github.io/MTCNN_face_detection_alignment/paper/spl.pdf, it says

After that, we refine these candidates in the next stage through a Refinement Network (R-Net). In the third stage, The Output Network (O-Net) produces final bounding box and facial landmarks position.

        output = self.ONet.predict(input_buf)
        #print(output[2])

        # filter the total_boxes with threshold
        passed = np.where(output[2][:, 1] > self.threshold[2])
        total_boxes = total_boxes[passed]

        if total_boxes.size == 0:
            return None

        total_boxes[:, 4] = output[2][passed, 1].reshape((-1,))
        reg = output[1][passed]
        points = output[0][passed]

But again, the code in mtcnn_detector.py is hard to read... I know detection and preprocessing is not strictly your concern here, but judging by my tests, they seem more intertwined that not. I would really appreciate it if you could provide an alternative to the current mtcnn mxnet implementation, or some hybrid approach to be able to use the same detection bboxes and landmarks method (as the training sets) before alignment and nnet forwarding.

Thanks!

nttstar commented 6 years ago

@terencezl Yes ONet produces landmarks, not RNet. Different implementations must have some differences on the outputs. The code under /deploy is just for reference and the included mxnet MTCNN is used to detect landmarks from your input images. So, from the working pipeline of MTCNN, the input of RNet and ONet are both cropped face image patch, you need to use one specific face detector before sending it to face_embedding class(MTCNN as detector is also OK).

Wisgon commented 6 years ago

I use mtcnn for caffe(https://github.com/DuinoDu/mtcnn) to detect two more faces in a picture, and I use pyseeta(https://github.com/TuXiaokang/pyseeta) to align faces, and the result is more better. You can try it, just need to modify a little code in face_embedding.py.

terencezl commented 6 years ago

I see. I tried a different MTCNN implementation https://github.com/ipazc/mtcnn, which is a rewrite of facenet's (https://github.com/davidsandberg/facenet/tree/master/src/align model file contents should be the same) to get the landmarks (points), and used them to do alignment, but the result is not as good as that using the mxnet MTCNN, when it is able to detect faces.

In addition, assuming the landmarks are close enough, I see the code in src/align has David Sandberg's facenet implementation https://github.com/davidsandberg/facenet/tree/master/src/align. Did you use this detector/landmark detector, and also use his alignment strategy https://github.com/deepinsight/insightface/blob/master/src/align/align_dataset_mtcnn.py, or you used the detectors but carried out your own alignment as in

https://github.com/deepinsight/insightface/blob/master/src/align/align_insight.py https://github.com/deepinsight/insightface/blob/master/src/align/align_megaface.py https://github.com/deepinsight/insightface/blob/master/src/align/align_facescrub.py

?

terencezl commented 6 years ago

I made a mistake during conversion. Now the accuracy is good. Thanks.

        for i in ['left_eye', 'right_eye', 'nose', 'mouth_left', 'mouth_right']:
            points.append(lms[i][0])
            points.append(lms[i][1])
Edwardmark commented 6 years ago

@terencezl what mistake you make?can you tell us? I think the mtcnn detector here is not good, many frontal face cannot be detected.