Closed wangsuzhen closed 3 years ago
So I resize the audio_sequence to (B, 1, dim, T), is this correct?
Yes.
using your script from real videos
The released SyncNet model is not trained on large enough data to generalize well for real videos. It works only for LRS2 videos, which is sufficient to train Wav2Lip on LRS2.
In the code of syncnet model, the shape of _audiosequence is denoted as (B, dim, T), however, a 3D tensor is obviously can not be taken as the input of a 2D CNN network. So I resize the audio_sequence to (B, 1, dim, T), is this correct?
I extract the image sequence and mel sequence using your script from real videos, and calculate the cosine loss with your pretrained syncnet model. But I found the similar results between sync-pair and out-of-sync-pair. What's wrong with this?