Has anyone tried training with high-qual avspeech dataset?

Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs

https://synclabs.so

10.4k stars 2.23k forks source link

Has anyone tried training with high-qual avspeech dataset? #167

Closed jays0606 closed 3 years ago

jays0606 commented 3 years ago

I have tried to do so after resizing the image to 192 x 192, but the sync loss does not fall below 0.7, so the final result isn't very good.

Has anyone succeeded in training with high-qual videos?

I'm not sure what the problem is

rebotnix commented 3 years ago

Do you just resize the image or can the layers inside the network? Image Rezing will not help.

jays0606 commented 3 years ago

I used the avspeech data, resized face detection results to 192x192, added some layers into the model, and began training. Any more suggestions? Thank you

CHUUUU commented 3 years ago

i resize input image to 512x512 (using wav2lip face_detection and resize) -> and generator output image 512x512 and resize 96x96 but almost sync loss is between 1 and 2

jays0606 commented 3 years ago

i resize input image to 512x512 (using wav2lip face_detection and resize) -> and generator output image 512x512 and resize 96x96 but almost sync loss is between 1 and 2

How was your final output?

CHUUUU commented 3 years ago

lips don't follow the sound , but follow the image, because decoder almost doesnt get sync loss.

and i tried to put label image and label mel to syncnet , sync loss was between 1 and 2

so i think syncnet pretrain has a problem

and.. i tried to pretrain syncnet with my own dataset 30 min (one person), loss 0.75->0.69 and it doesnt drop anymore

jinny960812 commented 3 years ago

I tried with my own dataset but it either sync well but has lots of artifacts or doesn't sync but looks clean and realistic. I looked at the loss graph and it actually does drop and rise abruptly... Does anyone know what the problem is?

rebotnix commented 3 years ago

The loss must be < 0.25 to get good results. Loss 1 - 2 is not useable at all and it seems that you have an issue in your dataset. Try to split to more smaller clips. The other thing is that you have to redesign the whole network, cause this project uses input size of only 96px. It´s required to add more layers to accept higher input size for your dataset as well for the inference. After that you have to re-train syncnet first, then wav2lip. Hope that can help you guys.