On every clip I train (including those with longer silences), the inference output results in the mouth staying open during silences (max mel value of -3.5 ish across all frequency bands).
I was thinking of detecting silences in the input and then replacing the silent enc_auds with the ones right before the silence where the lip is closed. Is this an architectural problem with the audio encoding/does anybody have a resolution?
On every clip I train (including those with longer silences), the inference output results in the mouth staying open during silences (max mel value of -3.5 ish across all frequency bands).
https://github.com/ZiqiaoPeng/SyncTalk/assets/2669187/ad1d3b16-79e3-4aed-9853-10d2c02ae3dd
I was thinking of detecting silences in the input and then replacing the silent enc_auds with the ones right before the silence where the lip is closed. Is this an architectural problem with the audio encoding/does anybody have a resolution?