Closed vikasTmz closed 4 years ago
Sort of, I'm planning to write a full explanation soon, but just to give the general idea 💡. Putting unsupervised training aside, methods mainly use one of the following:
When comparing the two data domains we consistently saw that training with synthetic data lead to more accurate reconstruction of facial details, but resulted in noise around the edges and on occlusions.
To try and solve that we be played a bit with generating more realistic synthetic data. Specifically we placed the synthetic faces on real people and blending occlusions back. That way we can still utilize the accuracy of synthetic data with more realistic images.
While using the blended data resulted in some additional robustness, we found that it does not fully mimic real data and still results in some artifacts. This made us decide to try and directly utilize the real data to get extra robustness to those patterns.
While using rendering losses is a great way to learn geometry from real data without labels, we realized what we really want from real data is only to learn the face region and not the geometry. To achieve that we simply added an auxiliary task of predicting a segmentation map for the head region (0, 1 values). While the geometry maps were trained only from synthetic images, the segmentation task was trained on both synthetic and real data fitted with a simple 3DMM, allowing the model to see many real images during training. This simple scheme was able to reduce most of the artifacts we have encountered 🙃.
Ah, neat! Thanks for the detailed explanation.
As this uses the pix2pix pipeline, I'm curious to how unlabelled real images were incorporated during training. Did you fit the synthetic dataset to some real life face dataset like CelebA as done by Sengupta et al. in SfSNet ?
Edit: It also looks like your new model is more robust to occlusions than the previous one.