Open jinwonkim93 opened 1 year ago
Hi thank you for wonderful work!
I am having hard time understanding which part of 3dmm coefficient refers to expression, blink, pose.
could you give me some references to that information?
Hi. This is also my question. Did you find an answer to it?
Too complex for a hobbyist like me but I found these that explain it
Figure 3. The structure of our ExpNet. We involve a monocular 3D face reconstruction model [5] (Re and Rd) to learn the realistic expression coefficients. Where Re is a pretrained 3DMM coeffi- cients estimator and Rd is a differentiable 3D face render without learnable parameters. We use the reference expression β0 to reduce the uncertainty of identity and the generated frame from pre-trained Wav2Lip [28] and the first frame as target expression coefficients since it only contains the lip-related motions. As shown in Figure 3, we generate the t-frame expression coefficients from an audio window a{1,..,t}, where the audio feature of each frame is a 0.2s mel-spectrogram. For training, we first design a ResNet-based audio encoder ΦA [12, 28] to embed the audio feature to a latent space. Then, a linear layer is added as the mapping network ΦM to decode the expres- sion coefficients. Here, we also add the reference expression β0 from the reference image to support emotions and reduce the identity uncertainty as discussed above. Since we use the lip-only coefficients as ground truth in the training, we ex- plicitly add a blinking control signal zblink ∈ [0, 1] and the corresponding eye landmark loss to generate the controllable eye blinks. Formally, the network can be written as: β{1,...,t} = ΦM (ΦA(a{1,...,t}), zblink, β0)
Hi thank you for wonderful work!
I am having hard time understanding which part of 3dmm coefficient refers to expression, blink, pose.
could you give me some references to that information?