Questions about the input of Syncnet

Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs

https://synclabs.so

9.8k stars 2.13k forks source link

Questions about the input of Syncnet #182

Closed wangsuzhen closed 3 years ago

wangsuzhen commented 3 years ago

In the code of syncnet model, the shape of _audiosequence is denoted as (B, dim, T), however, a 3D tensor is obviously can not be taken as the input of a 2D CNN network. So I resize the audio_sequence to (B, 1, dim, T), is this correct?

I extract the image sequence and mel sequence using your script from real videos, and calculate the cosine loss with your pretrained syncnet model. But I found the similar results between sync-pair and out-of-sync-pair. What's wrong with this?

prajwalkr commented 3 years ago

So I resize the audio_sequence to (B, 1, dim, T), is this correct?

Yes.

using your script from real videos

The released SyncNet model is not trained on large enough data to generalize well for real videos. It works only for LRS2 videos, which is sufficient to train Wav2Lip on LRS2.