OpenTalker / SadTalker

[CVPR 2023] SadTalker:Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
https://sadtalker.github.io/
Other
11.14k stars 2.09k forks source link

information about coefficient of 3DMM #524

Open jinwonkim93 opened 11 months ago

jinwonkim93 commented 11 months ago

Hi thank you for wonderful work!

I am having hard time understanding which part of 3dmm coefficient refers to expression, blink, pose.

could you give me some references to that information?

Tinaa23 commented 9 months ago

Hi thank you for wonderful work!

I am having hard time understanding which part of 3dmm coefficient refers to expression, blink, pose.

could you give me some references to that information?

Hi. This is also my question. Did you find an answer to it?

G-force78 commented 3 months ago

Too complex for a hobbyist like me but I found these that explain it

https://openaccess.thecvf.com/content/CVPR2023/supplemental/Zhang_SadTalker_Learning_Realistic_CVPR_2023_supplemental.pdf

https://openaccess.thecvf.com/content/CVPR2023/papers/Zhang_SadTalker_Learning_Realistic_3D_Motion_Coefficients_for_Stylized_Audio-Driven_Single_CVPR_2023_paper.pdf

Figure 3. The structure of our ExpNet. We involve a monocular 3D face reconstruction model [5] (Re and Rd) to learn the realistic expression coefficients. Where Re is a pretrained 3DMM coeffi- cients estimator and Rd is a differentiable 3D face render without learnable parameters. We use the reference expression β0 to reduce the uncertainty of identity and the generated frame from pre-trained Wav2Lip [28] and the first frame as target expression coefficients since it only contains the lip-related motions. As shown in Figure 3, we generate the t-frame expression coefficients from an audio window a{1,..,t}, where the audio feature of each frame is a 0.2s mel-spectrogram. For training, we first design a ResNet-based audio encoder ΦA [12, 28] to embed the audio feature to a latent space. Then, a linear layer is added as the mapping network ΦM to decode the expres- sion coefficients. Here, we also add the reference expression β0 from the reference image to support emotions and reduce the identity uncertainty as discussed above. Since we use the lip-only coefficients as ground truth in the training, we ex- plicitly add a blinking control signal zblink ∈ [0, 1] and the corresponding eye landmark loss to generate the controllable eye blinks. Formally, the network can be written as: β{1,...,t} = ΦM (ΦA(a{1,...,t}), zblink, β0)