Really Bad Lip Sync Results For Use Case 1

abm505 commented 4 years ago

Hi,

Thank you for sharing the model and it's great to see the progress made overall.

When testing, I've observed that while the results are somewhat as expected for talking face videos - i.e. use case 2, lip movements of picture - but the results are really bad when generating correct lip motion on a random talking face video - i.e. the Use Case 1 as mentioned in the repository. I am comparing with the results shown in Github or discussed in the paper.

In the results generated for Use Case 1, the lip motion seems almost the same as source video. Basically I'm trying to understand if it's supposed to be like that or not. It appears as if there is nearly no lip-sync at all - the lip movements are almost like those in the source video and not much indicative of words being played in the input audio. If input audio has a pause, the lip movements keep happening if the source video had lip movements.

I'm sharing some examples of the results to get a better idea of the model's capabilities - results are generated using a sample video of Obama and another sample audio of Obama

Here is another example with a different video of Obama and another sample audio of Obama:

To generate the results, I got someone to create a colab notebook and they used Librosa approach. I can share the notebook in case you want to see if some error was made.

Again, really appreciate the model and I feel it represents a great advance in the overall tech, just creating the issue just to see whether the quality of results is what is to be expected from the model or whether it can be improved. Thank you.

abm505 commented 4 years ago

Just to save some time of the researchers and to give more details, here is the colab notebook used for inference:

abm505 commented 4 years ago

Closing issue since no comments from researchers - removed links

aretius commented 4 years ago

@abm505 I have faced similar issues while doing for the video, any suggestions?

prajwalkr commented 4 years ago

One of the main limitations of the model is that it is inaccurate for some videos in the wild, especially during silences. This is a known issue in existing talking face models and is being addressed in future work. We will update this issue when the new work is released.

prajwalkr commented 3 years ago

It appears as if there is nearly no lip-sync at all - the lip movements are almost like those in the source video and not much indicative of words being played in the input audio. If input audio has a pause, the lip movements keep happening if the source video had lip movements.

Please switch to this latest improved work: https://github.com/Rudrabha/Wav2Lip

Rudrabha / LipGAN

Really Bad Lip Sync Results For Use Case 1 #18