Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.4k stars 2.23k forks source link

sync loss of expert discriminator #55

Closed Augnine closed 4 years ago

Augnine commented 4 years ago

I am training the expert discriminator using my own dataset,but the loss is above 0.69. I am confused whether the model can be used for ‘wav2lip_train'.

prajwalkr commented 4 years ago

You can use the expert disc with good effect only if the expert disc eval loss is about ~0.25

Augnine commented 4 years ago

You can use the expert disc with good effect only if the expert disc eval loss is about ~0.25

Thanks for your reply. There is another question about hyperparameters.

you choose: batch size = 64 learning rate = 1e-4, Adam optimizer with default parameters

Is there any other important hyper-parameter that can hace a large effect

prajwalkr commented 4 years ago

Is there any other important hyper-parameter that can have a large effect

Not to our knowledge.

Augnine commented 4 years ago

Is there any other important hyper-parameter that can have a large effect

Not to our knowledge.

I will try it again.Thanks

siddu9501 commented 4 years ago

Here is one overlooked part of the code in dataloader when using external datasets. https://github.com/Rudrabha/Wav2Lip/blob/master/color_syncnet_train.py#L77

The img_name and wrong_img_name are chosen randomly. The syncnet paper says that the positive and negative examples are to be within a window of 2 seconds. The network might not learn anything when given something which is completely out of sync.

So, you might want to change that window to be a random choice within 100 frames in either direction.

prajwalkr commented 4 years ago

The syncnet paper says that the positive and negative examples are to be within a window of 2 seconds.

Do you know the reason why this would work better? We do not, as even a randomly sampled segment is off sync. Please let us know if you have some idea.

siddu9501 commented 4 years ago

Intuitively, if the windows was smaller, We choose negative pairs in close proximity, or even with partial overlap more often. These are harder examples to learn. As opposed to random sampling from the entire video, where you would not see such difficult examples as often. It also keeps the remainder of the lower face area similar (In most cases), which means more importance given to the lip area to distinguish positive from negative examples.

prajwalkr commented 4 years ago

We will definitely try this and will add it as a suggestion if it works better. Thanks!