Poor performance of pretrained SyncNet (lipsync_expert.pth) on AVSpeech videos

YinonDouchanClarity commented 1 year ago

I tried using the pre-trained syncnet model on a subset of the AVSpeech dataset. The videos that I chose from AVSpeech were 25FPS videos. In color_syncnet_train.py, I tried turning off backpropagation in order to see the loss the pre-trained model had on those videos, either with an in-sync pair or an out-of-sync pair. When I did so, the loss I got was somewhere between 6-7.

I know that this model was pre-trained on LRS2 and not AVSpeech. Should this model be able to generalize to datasets other than LRS2? Did anyone try using the pre-trained syncnet for inference and had positive results?

zhanglonghao1992 commented 1 year ago

Have you figured it out bro

YinonDouchanClarity commented 1 year ago

I haven't checked it, but I think I know what causes it: The videos must be sync-corrected. This means it can't even tolerate an audio-video offset of a frame or two. Looking at the original SyncNet article, they find the AV offset by looking at the audio of each frame, applying a sliding window on the frame sequences and inputting those pairs to SyncNet to see how synced the audio looks with the video with a given offset. They then, by averaging the results over all frames, find the offset with the minimum distance/maximum similarity. You can read about it more in the original SyncNet paper: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf

Therefore, inputting the audio and video chunks assumes that the AV offset is 0 and this might give poor results. The authors of Wav2Lip didn't implement the aforementioned AV-offset estimation method, since they only used it as an additional loss for training the generator. This should take a few dozen lines of code to implement.

rainbowoldhorse commented 1 year ago

hey bro，That's exactly what I'm looking for，Have you resolved the issue of delay between audio and video? Can you share your code

lbdave94 commented 5 months ago

@YinonDouchanClarity @rainbowoldhorse did you solve somehow?

Rudrabha / Wav2Lip

Poor performance of pretrained SyncNet (lipsync_expert.pth) on AVSpeech videos #500