Open YinonDouchanClarity opened 1 year ago
Have you figured it out bro
I haven't checked it, but I think I know what causes it: The videos must be sync-corrected. This means it can't even tolerate an audio-video offset of a frame or two. Looking at the original SyncNet article, they find the AV offset by looking at the audio of each frame, applying a sliding window on the frame sequences and inputting those pairs to SyncNet to see how synced the audio looks with the video with a given offset. They then, by averaging the results over all frames, find the offset with the minimum distance/maximum similarity. You can read about it more in the original SyncNet paper: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf
Therefore, inputting the audio and video chunks assumes that the AV offset is 0 and this might give poor results. The authors of Wav2Lip didn't implement the aforementioned AV-offset estimation method, since they only used it as an additional loss for training the generator. This should take a few dozen lines of code to implement.
hey bro,That's exactly what I'm looking for,Have you resolved the issue of delay between audio and video? Can you share your code
@YinonDouchanClarity @rainbowoldhorse did you solve somehow?
I tried using the pre-trained syncnet model on a subset of the AVSpeech dataset. The videos that I chose from AVSpeech were 25FPS videos. In color_syncnet_train.py, I tried turning off backpropagation in order to see the loss the pre-trained model had on those videos, either with an in-sync pair or an out-of-sync pair. When I did so, the loss I got was somewhere between 6-7.
I know that this model was pre-trained on LRS2 and not AVSpeech. Should this model be able to generalize to datasets other than LRS2? Did anyone try using the pre-trained syncnet for inference and had positive results?