train Lip-Sync Expert model have not declined on my custom dataset

Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs

https://synclabs.so

10.4k stars 2.23k forks source link

train Lip-Sync Expert model have not declined on my custom dataset #419

Open mrlihellohorld opened 2 years ago

mrlihellohorld commented 2 years ago

Thank you for open-sourcing such a great project. I read your training method carefully follow https://github.com/Rudrabha/Wav2Lip#training-on-datasets-other-than-lrs2. first, I trained the expert discriminator for my own dataset before training Wav2Lip. but the loss have not declined The total length of my data set is 70 minutes, and it is divided into one-minute videos. The video samples are as follows https://user-images.githubusercontent.com/31719207/192180675-e988ccc8-2ee8-4fd3-9950-9b9584364aee.mp4 could you give me some advice

Chinenana commented 1 year ago

I have countered the same problem....how to decline the cosine loss

NikitaKononov commented 1 year ago

You need much more data. Tens and hundreds of videos

Chinenana commented 1 year ago

Thank you dear,I’ll have a try! Thanks for your attention again🥰

发自我的iPhone

------------------ Original ------------------ From: NikitaKononov @.> Date: Sat,Nov 5,2022 7:52 PM To: Rudrabha/Wav2Lip @.> Cc: YiTingYan @.>, Comment @.> Subject: Re: [Rudrabha/Wav2Lip] train Lip-Sync Expert model have not declined on my custom dataset (Issue #419)

You need much more data. Tens and hundreds of videos

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

yagcaglar commented 1 year ago

Hi , I am planning to train the model with a different dataset having similar scene in the video provided. I have concerns that the background will effect the accuracy, how was your results? Do you have any recommendations of preprocessing for cropping the video and focusing on the lip area (for movements of the head) like in LRS2 dataset?

NikitaKononov commented 1 year ago

Hi , I am planning to train the model with a different dataset having similar scene in the video provided. I have concerns that the background will effect the accuracy, how was your results? Do you have any recommendations of preprocessing for cropping the video and focusing on the lip area (for movements of the head) like in LRS2 dataset?

Training this implementation is hopeless - 96x96 resolution will lead to crappy quality in all cases How much vid samples do you have? To train strong syncnet (288x288 for example) you'll need hundreds of thousands video clips

yagcaglar commented 1 year ago

Thanks for quick response, I am planning to use https://github.com/deeplsd/Merkel-Podcast-Corpus , it has ~28 hours of video however the videos are not cropped for the face itself so I have to modify it by hand.

NikitaKononov commented 1 year ago

Thanks for quick response, I am planning to use https://github.com/deeplsd/Merkel-Podcast-Corpus , it has ~28 hours of video however the videos are not cropped for the face itself so I have to modify it by hand.

If you'll train model with single person it won't have generalization ability And will behave poorly on other vids

Such dataset can be good, if you want to train some end2end talking head generation model

tanshuai0219 commented 1 year ago

Hey, have you solve the problem? I have countered the same problem, I trained on the lrw dataset and the loss has dropped below 0.2 on the training set but stays 0.68-0.69 on the test set, could you give me some advice?

wllps1988315 commented 1 year ago

https://github.com/deeplsd/Merkel-Podcast-Corpus

hi, may I ask you some questions about LRS2 dataset.

     I want to training on datasets other than LRS2,

      how could I do that?

      1. what NF and MV mean in test.txt？
      6330311066473698535/00011 NF
      6330311066473698535/00018 MV

       2. Does "Conf" means confidence in 00001.txt？
       Text: WHEN YOU'RE COOKING CHIPS AT HOME
       Conf: 4

1105135335 commented 1 year ago

cropped for the face

Do I have to crop the video for the face first?

yagcaglar commented 1 year ago

First I tried the training without the cropping but the loss did not dropped as expected ( around 0.2) because model tries to generate the lower half of the image (see issue #375 for input image). Also as @NikitaKononov mentioned, the resolution of the image reduces so again it's better to crop as much as you can. With the cropped image I was able to see the loss around 0.03.

I8Robot commented 1 year ago

Thank you dear,I’ll have a try! Thanks for your attention again🥰 发自我的iPhone

Do you have the background problem, when train set background is‘t variety. My trainset background is blue or white, train result is poor, and can't deal with cover, such as finger cover face.

Gangadhar24377 commented 1 year ago

https://github.com/deeplsd/Merkel-Podcast-Corpus

hi, may I ask you some questions about LRS2 dataset.

     I want to training on datasets other than LRS2,

      how could I do that?

      1. what NF and MV mean in test.txt？
      6330311066473698535/00011 NF
      6330311066473698535/00018 MV

       2. Does "Conf" means confidence in 00001.txt？
       Text: WHEN YOU'RE COOKING CHIPS AT HOME
       Conf: 4

I wanna know the same thing. Did u got to know what these are ?

Gangadhar24377 commented 1 year ago

I'm a begginer, and tryna learn and work with wav2lip model and actually wanna train on my custom dataset which is similar in structure to lrs2 dataset. Can u guys guide me through the procedure what to do as i have looked into readme file, and am a bit confused.

Help would be very much appreciated !