Hangz-nju-cuhk / Talking-Face_PC-AVS

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)
Creative Commons Attribution 4.0 International
916 stars 169 forks source link

why embedding the audio features #58

Open e4s2022 opened 2 years ago

e4s2022 commented 2 years ago

Hi, thanks for sharing this great work!

I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code. https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L473-L484

As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function: https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L454-L461

I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.