I am a beginner ML student attempting to build a small inswapper-like model. I have made some educated guesses about the components of the loss function based on my understanding (or lack thereof 🥲) and would appreciate clarification and guidance on the following aspects:
Identity Loss Component
I plan to use a pre-trained face recognition model, such as ArcFace, to extract identity embeddings from both source and output images and calculate an identity loss.
However, I am uncertain about whether the (1,512) embeddings from ArcFace are sufficient to capture fine details like skin texture. Additionally, I wonder if the same pre-trained face recognition model and embedding size were used for both inswapper_128 and higher-resolution models. Could you provide insights into this?
Adversarial Loss Component
It seems reasonable to employ a discriminator to distinguish between real and generated images. I would appreciate any additional details or considerations related to the adversarial loss component.
Expression and Head Shape
I don't have a good idea about how to preserve the expression from the destination face while retaining the head and face shape of the source. I could extract face landmarks to calculate loss, but using too many landmarks (68 or 98) could make the output face shape resemble the destination face. (maybe i should just use a few face keypoints (5) instead?)
Could you share insights into how this was achieved within the loss function or other components of the inswapper model?
I am a beginner ML student attempting to build a small inswapper-like model. I have made some educated guesses about the components of the loss function based on my understanding (or lack thereof 🥲) and would appreciate clarification and guidance on the following aspects:
Identity Loss Component I plan to use a pre-trained face recognition model, such as ArcFace, to extract identity embeddings from both source and output images and calculate an identity loss. However, I am uncertain about whether the (1,512) embeddings from ArcFace are sufficient to capture fine details like skin texture. Additionally, I wonder if the same pre-trained face recognition model and embedding size were used for both inswapper_128 and higher-resolution models. Could you provide insights into this?
Adversarial Loss Component It seems reasonable to employ a discriminator to distinguish between real and generated images. I would appreciate any additional details or considerations related to the adversarial loss component.
Expression and Head Shape I don't have a good idea about how to preserve the expression from the destination face while retaining the head and face shape of the source. I could extract face landmarks to calculate loss, but using too many landmarks (68 or 98) could make the output face shape resemble the destination face. (maybe i should just use a few face keypoints (5) instead?) Could you share insights into how this was achieved within the loss function or other components of the inswapper model?