why embedding the audio features

Hi, thanks for sharing this great work!

I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code. https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L473-L484

As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function: https://github.com/Hangz-nju-cuhk/Talking-Face_PC-AVS/blob/23585e281360872fe2d1e1eec8ff49176ea0183d/models/av_model.py#L454-L461

I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.

Hangz-nju-cuhk / Talking-Face_PC-AVS

why embedding the audio features #58