Closed melgor closed 6 years ago
Hi @melgor , I will try to reply to all your questions. I think there is some confusion and I hope this can clear it out.
In this question, I do not understand if you refer to the testing part or training part of FPN. I will try to describe the training part. For FPN training, what we did was to use an off-the-shelf landmark detector method (OpenFace)[3] to estimate 6DoF for each input image in the training set, given a generic 3D model with labeled 3D landmarks in correspondence with the 2D ones. A novel part of the work is that we augment the set with a simple 2D similarity transformation to generate very tough samples, on which standard landmark detectors are likely to fail (Fig. 4 in the paper). In our case, since we know the transformation parameters (we generated them), we can map the pose from the original, "easy" input image on the perturbed image, getting a new "labeled" pose for free. Note that in this process we did not make use of 3D augmentation [A] when training FPN. 3D augmentation to generate multiple views has been used when training the recognition network but this is another paper [B,C].
See reply above. We used OpenFace [3] but other methods can work as well. In order to make the method robust, you need to perform 2D augmentation as described in the paper.
When training the FPN model, the input images are always 224x224. Our 2D augmentation for FPN training is stochastic; that is, we randomly sampled the transformation parameters by varying rotation angle, scale, and translation. So at training time, the faces "are always moving in the 224x224 support". The way the faces are jittered is using a 2D transformation s,R,t
and some amount blur to simulate low-resolution faces in videos. In testing, we fixed the crop to the face to roughly contain the entire head in the 224x224 input image.
The testing for recognition is pretty complex and does not use only 3D renderings but combines both 2D+3D alignments. Images are processed with FPN to get 6DoF, then we render images in 3D following [A,B,C]. Note that in the rendering, if faces are far from frontal, we avoid frontalize them. So basically we augmented with new 3D views only near to the input pose. Moreover, using the estimated pose by FPN we also compute a 2D similarity transformation for recognition to align the images in 2D. A note: the reference point for the 2D similarity transform are different for frontal faces and profile faces. Frontal faces are aligned with canonical points on eyes, nose, mouth; for profile face, we use the visible eye and the tip of the nose. When all the images are aligned (2D+3D) with FPN as mentioned above, they are fed into the recognition network [C] and their features are pooled with average and with some other tricks (PCA + Power Normalization ) into a single compact descriptor. Experimentally we observed that most of the recognition power is in the 2D images but 3D views are also improving results when performing the descriptor pooling. For more info on this check [C].
I hope this helps.
[3] https://github.com/TadasBaltrusaitis/OpenFace
[A] https://github.com/iacopomasi/face_specific_augm ?
[B] Masi et al. "Do We Really Need to Collect Million of Faces for Effective Face Recognition? ", ECCV16
[C] Masi et al. "Rapid synthesis of massive face sets for improved face recognition." FG2017
Thanks for the answer, it will help me to understand all the process for sure!
@iacopomasi
one more question about the face recognition, as you mentioned
Note that in the rendering, if faces are far from frontal, we avoid frontalize them. So basically we augmented with new 3D views only near to the input pose
so the result of the 3D alignment is only the frontalized face, not the multi-views of the input ? If the input is not near frontal, do you skip 3D alignment?
moreover, after the 3D alignment, how do you get the five 2D landmarks?
By the way, In most of the face recognition papers, they only do 2D alignment. But in your paper, I didn't see the performance gain when doing additional 3D alignment.
so the result of the 3D alignment is only the frontalized face, not the multi-views of the input ?
It is both; that is multiple-views;
if the input is not near frontal, do you skip 3D alignment?
no, we don't. If the input is near-profile, we do not frontalize
moreover, after the 3D alignment, how do you get the five 2D landmarks?
by definition, if you do 3D alignment, the faces are already aligned to a 3D shape, up to any possible alignment errors (so the landmarks should be always in the same coordinate system for each rendered view)
By the way, In most of the face recognition papers, they only do 2D alignment. But in your paper, I didn't see the performance gain when doing additional 3D alignment.
yes, this is true and it is a trade-off of the methods. In our case, we got improvement by feeding 2D aligned images and 3D rendered images and averaging the feature vectors after. We further got a boost in results by applying PCA and signed square rooting usually used in Fisher-Vector representation.
Hi. Could you give some more details how did you prepare the data?