Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.18k stars 2.19k forks source link

Sync loss cannot be reduced #27

Closed wuxiaolianggit closed 3 years ago

wuxiaolianggit commented 3 years ago

excuse me,I would like to ask how to solve the problem that sync loss cannot be reduced. @prajwalkr

wuxiaolianggit commented 3 years ago

L1: 0.055714061856269835, Sync Loss: 6.791266250610351: : 5it [00:12, 2.51s/it] Starting Epoch: 5759 global_step is: 28800 0it [00:00, ?it/s]===========================saving images L1: 0.05536094978451729, Sync Loss: 5.467270183563232: : 5it [00:12, 2.46s/it] Starting Epoch: 5760 global_step is: 28805 L1: 0.05613975748419762, Sync Loss: 6.362647533416748: : 5it [00:12, 2.43s/it] Starting Epoch: 5761 global_step is: 28810 L1: 0.054631995409727095, Sync Loss: 4.843200063705444: : 5it [00:13, 2.72s/it] Starting Epoch: 5762 global_step is: 28815 L1: 0.05542518720030785, Sync Loss: 6.10022611618042: : 5it [00:11, 2.40s/it] Starting Epoch: 5763 global_step is: 28820 L1: 0.05484302267432213, Sync Loss: 5.2569538116455075: : 5it [00:12, 2.42s/it] Starting Epoch: 5764 global_step is: 28825

prajwalkr commented 3 years ago

Your epoch seems to consist of 5 iterations. What dataset are you training on?

wuxiaolianggit commented 3 years ago

The dataset is self-made, not public @prajwalkr

prajwalkr commented 3 years ago
  1. The dataset seems to be extremely small.
  2. If sync loss does not reduce to < 1 using just L1 loss, there is an issue with your dataset itself.
wuxiaolianggit commented 3 years ago

I spent 13 minutes recording video with a single person, FPS is 60

wuxiaolianggit commented 3 years ago

Do you think the data set of this video production can be used?

prajwalkr commented 3 years ago

I spent 13 minutes recording video with a single person, FPS is 60

I do not know if 13 minutes of data will yield any useful results. But more importantly, I strongly suggest you resample your videos to 25FPS. If not, you will have to change the code in several places.

wuxiaolianggit commented 3 years ago

Does FPS have a great influence on training results?

prajwalkr commented 3 years ago

Does FPS have a great influence on training results?

Yes, the audio window etc. are determined based on that.

Also, you need to train the expert discriminator on your own dataset first, before training the lip-sync generator.

wuxiaolianggit commented 3 years ago

How big is the data set needed to train for good results? Because I can't get lrs2 data, I don't know what size of data set is appropriate.

prajwalkr commented 3 years ago

If you want to train for any speaker in the wild, LRS2 dataset that the released models are trained on is about 29 hrs. We are not sure about the dataset size if you want to only do for a single speaker.

wuxiaolianggit commented 3 years ago

Thank you very much for your reply