Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.57k stars 2.26k forks source link

Poor performance of pretrained SyncNet (lipsync_expert.pth) on AVSpeech videos #500

Open YinonDouchanClarity opened 1 year ago

YinonDouchanClarity commented 1 year ago

I tried using the pre-trained syncnet model on a subset of the AVSpeech dataset. The videos that I chose from AVSpeech were 25FPS videos. In color_syncnet_train.py, I tried turning off backpropagation in order to see the loss the pre-trained model had on those videos, either with an in-sync pair or an out-of-sync pair. When I did so, the loss I got was somewhere between 6-7.

I know that this model was pre-trained on LRS2 and not AVSpeech. Should this model be able to generalize to datasets other than LRS2? Did anyone try using the pre-trained syncnet for inference and had positive results?

zhanglonghao1992 commented 1 year ago

Have you figured it out bro

YinonDouchanClarity commented 1 year ago

I haven't checked it, but I think I know what causes it: The videos must be sync-corrected. This means it can't even tolerate an audio-video offset of a frame or two. Looking at the original SyncNet article, they find the AV offset by looking at the audio of each frame, applying a sliding window on the frame sequences and inputting those pairs to SyncNet to see how synced the audio looks with the video with a given offset. They then, by averaging the results over all frames, find the offset with the minimum distance/maximum similarity. You can read about it more in the original SyncNet paper: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf

Therefore, inputting the audio and video chunks assumes that the AV offset is 0 and this might give poor results. The authors of Wav2Lip didn't implement the aforementioned AV-offset estimation method, since they only used it as an additional loss for training the generator. This should take a few dozen lines of code to implement.

rainbowoldhorse commented 1 year ago

hey bro,That's exactly what I'm looking for,Have you resolved the issue of delay between audio and video? Can you share your code

lbdave94 commented 5 months ago

@YinonDouchanClarity @rainbowoldhorse did you solve somehow?