Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.4k stars 2.23k forks source link

About wav2lip_GAN training #256

Open 15458wew opened 3 years ago

15458wew commented 3 years ago

Hi, thank you for sharing such a good project. I am trying to reproduce your project on LRW. First of all, I did not train the synchronous discriminator, and use the pre-trained model you provided to experiment. I encountered some problems:

  1. The value of perceptual_loss (min: 0.1, max: 27) has nothing to do with the actual generation effect. I have conducted many experiments and got better face results at lower perceptula (0.6) and higher (27).
  2. During the training process, the sync loss threshold (0.75) is always not reached. When the sync wt is manually set to 0.1, it can be started and the loss can converge to a lower position (train sync loss: 0.22, eval sync loss: 0.19). The generated video can have a good mouth shape, but the lips will become very blurry.
  3. l1 loss can converge to 0.01-0.02 in any case. I also hope to discuss with you and solve these problems.

In addition, I trained the sync discriminator of 192 192 myself, and modified the network structure to enable 192 192 video generation, but during the training process, perceptual_loss and real will become 0, and fake will rise to 27. Checked the sample during the training process, and the generated pictures are acceptable.

So I am currently at a loss and don't know how to solve these problems. Looking forward to your reply.

chuqidecha commented 3 years ago

I encountered these problems , too.

When perceptual_loss and real(or fake) loss become 0, and fake(or real) rise to 27 , Wav2Lip_disc_qual predicts 1(or 0) for arbitrary x. Because the Wav2Lip_disc_qual's face_encoder_blocks encodes arbitrary x for the same. Add BN after convolutional layer may help.

Have you solved other problems, now? I need help, too.

zhanghm1995 commented 3 years ago

Same problems...., have you solved this issues? @15458wew

OldSixOne commented 2 years ago

Hi, thank you for sharing such a good project. I am trying to reproduce your project on LRW. First of all, I did not train the synchronous discriminator, and use the pre-trained model you provided to experiment. I encountered some problems:

  1. The value of perceptual_loss (min: 0.1, max: 27) has nothing to do with the actual generation effect. I have conducted many experiments and got better face results at lower perceptula (0.6) and higher (27).
  2. During the training process, the sync loss threshold (0.75) is always not reached. When the sync wt is manually set to 0.1, it can be started and the loss can converge to a lower position (train sync loss: 0.22, eval sync loss: 0.19). The generated video can have a good mouth shape, but the lips will become very blurry.
  3. l1 loss can converge to 0.01-0.02 in any case. I also hope to discuss with you and solve these problems.

In addition, I trained the sync discriminator of 192 192 myself, and modified the network structure to enable 192 192 video generation, but during the training process, perceptual_loss and real will become 0, and fake will rise to 27. Checked the sample during the training process, and the generated pictures are acceptable.

So I am currently at a loss and don't know how to solve these problems. Looking forward to your reply.

Did you finally use 96 * 96

OldSixOne commented 2 years ago

I encountered these problems , too.

When perceptual_loss and real(or fake) loss become 0, and fake(or real) rise to 27 , Wav2Lip_disc_qual predicts 1(or 0) for arbitrary x. Because the Wav2Lip_disc_qual's face_encoder_blocks encodes arbitrary x for the same. Add BN after convolutional layer may help.

Have you solved other problems, now? I need help, too.

Have you solved it?

OldSixOne commented 2 years ago

Same problems...., have you solved this issues? @15458wew

Have you solved it?