Rudrabha / Lip2Wav

This is the repository containing codes for our CVPR, 2020 paper titled "Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis"
MIT License
692 stars 152 forks source link

The pre-trained model does not seem to work well #21

Open enhuiz opened 3 years ago

enhuiz commented 3 years ago

I tried to use the pre-trained model to generate audio on the test set of the DL speaker, and got the following results:

dl_test_results.zip

The results do not sound very good, I got the score:

Mean PESQ: 1.2502756974913858 
Mean STOI: 0.051719609840522554
Mean ESTOI: 0.011818173468155018

which seems poorer.

I used this pre-trained model and the dl.json as config. What could be the possible problem?

joannahong commented 3 years ago

Same problem here. Can anyone answer if the problem get solved?

Rudrabha commented 3 years ago

Can you check with ffmpeg version 2.8.15. If it does not work with this ffmpeg version, can you put "dl" in this if statement while preprocessing the data? I hope this will resolve the issue.

enhuiz commented 3 years ago

I have rerun the test set of different speakers with ffmpeg 2.8.15 using template2 and got the following results (I think it perhaps not an issue of the audio, as the audio sampling rate is 16k and quality sounds good):

chem

Calculating for 126 files

Mean PESQ: 1.2873870389802116
Mean STOI: 0.3381292155712502
Mean ESTOI: 0.2240446433485364

chess

Calculating for 106 files

Mean PESQ: 1.3580481702426694
Mean STOI: 0.3045034811738211
Mean ESTOI: 0.2072817270788053

dl

Calculating for 98 files

Mean PESQ: 1.2668071668975207
Mean STOI: 0.0557337203676023
Mean ESTOI: 0.016091822381584294

hs

Calculating for 86 files

Mean PESQ: 1.2996865982233092
Mean STOI: 0.4099540455183349
Mean ESTOI: 0.26221282797397855

eh

Calculating for 100 files

Mean PESQ: 1.3748773157596588
Mean STOI: 0.4417572027973521
Mean ESTOI: 0.2446916955337558

eh is better. chem, chess, and dl are similar or slightly worse, dl is worse.

I notice your script uses youtube-dl -f best to download the videos:

https://github.com/Rudrabha/Lip2Wav/blob/95d923f130e4cce93fdffbf67d2ec2d66eef933a/download_speaker.sh#L5

which does not always give the highest video resolution. For example:

youtube-dl -f best 9TFnjJkfqmA

downloads the video of resolution 640x360 instead of 1280x720, this causes the video has lower resolutions. Do you always use the highest resolution for testing or use the "best" given by youtube-dl? A higher resolution should give better results.

The other thing is that I'm using a different S3FD implementation for face detection. That may cause inconsistency. Since face detection is usually time taking and may cause inconsistency, may the author share the detection results (i.e. the face bounding boxes) for at least the test set?

enhuiz commented 3 years ago

You may also find my face bounding boxes: detection.zip. They can be used to crop the intervals into jpgs without additional face detection using this script.

kangzhiq commented 3 years ago

@Rudrabha @prajwalkr Thanks for this great work! However, I am also experiencing the same degradation of performance while testing with video clips of LRW dataset, more precisely I have used the implementation in the Multispeaker branch. The obtained scores are as follows:

Mean PESQ: 1.156
Mean STOI: 0.112
Mean ESTOI: 0.050

which are significantly lower than the reported values in the paper. I am wondering if you can provide insights about how to obtain the same performance.

FYI, I am following the exact configuration as explained in the Readme file.

Thank you in advance!

Rudrabha commented 3 years ago

We are not sure why this issue is occurring. Please check the quality of the generated samples. We have cross-checked for a few speakers in our system and found it to be working as expected.

kangzhiq commented 3 years ago

We are not sure why this issue is occurring. Please check the quality of the generated samples. We have cross-checked for a few speakers in our system and found it to be working as expected.

@Rudrabha Thank you for your reply! Do you mean the reported score in the paper was evaluated on part of the speakers instead of the entire dataset?

prajwalkr commented 3 years ago

No, it was reported on the entire test data. What @Rudrabha means is that, you can listen to a few generated samples and see if they sound alright. Usually, it is an issue of different FFMpeg version, wrong preprocessing etc.

kangzhiq commented 3 years ago

No, it was reported on the entire test data. What @Rudrabha means is that, you can listen to a few generated samples and see if they sound alright. Usually, it is an issue of different FFMpeg version, wrong preprocessing etc.

@prajwalkr Thank you for your reply! Indeed, I confirm that after downgrading the ffmpeg to 2.8.15, the scores are now similar to that in the paper. Maybe it is worth mentioning this in the Prerequisite part of Readme. Great work!

VirajBagal commented 3 years ago

@kangzhiq how do I install ffmpeg 2.8.15? When I use 'sudo apt-get install ffmpeg==2.8.15 ' I get this error: 'E: Version ‘2.8.15’ for ‘ffmpeg’ was not found' Please help.