Weizhi-Zhong / IP_LAP

CVPR2023 talking face implementation for Identity-Preserving Talking Face Generation With Landmark and Appearance Priors
Apache License 2.0
637 stars 72 forks source link

train video render, getting artifacts in the mouth #12

Closed ghost closed 1 year ago

ghost commented 1 year ago

Thanks for your great work and dedication! I reproduced the code, and training with lrs2 the running_warp_loss: 8.72333, running_gen_loss: 5.7946 but got the artifacts here image

Weizhi-Zhong commented 1 year ago

hi, thanks for you interest, what is the fid printed? and what is your batchsize

Weizhi-Zhong commented 1 year ago

https://github.com/Weizhi-Zhong/IP_LAP/issues/8#issuecomment-1569967050 @primepake refer to this issue hope this can help you

ghost commented 1 year ago

the batch is default 96, ref_N=5 and the tensorboard here. I stopped at epoch 84 How much did you stop training at epoch? image

ghost commented 1 year ago

the full result here: https://github.com/Weizhi-Zhong/IP_LAP/assets/39094983/a8431603-657b-403c-975a-cc3687155c11 my tensorboard here: https://tensorboard.dev/experiment/35MPt3dkTYu7WbAxT6OZhA/

Weizhi-Zhong commented 1 year ago

Hi, what is the ref_img_N you use in inference_single.py?

ghost commented 1 year ago

I used the default config in inference: image

Weizhi-Zhong commented 1 year ago

the batch is default 96, ref_N=5 and the tensorboard here. I stopped at epoch 84 How much did you stop training at epoch? image

We stop training near 300 epochs, where FID is around 19, eval_gen_loss is around 7, and eval_warp_loss is around 11. ​But this training uses ref_N=3.

We have not tried training with ref_N=5. The first reason is that it can be costly. More importantly, a few videos of lrs2 are short(e.g., 30 frames). If the renderer is trained with too many reference images, there may be some input reference images that are very similar to the ground truth, leading the network to learn a shortcut from the input reference images.

Have you tried training with ref_N=3? That may work. Hope this can help you.

ghost commented 1 year ago

oh that's a good idea! thank you so much

ghost commented 1 year ago

by the way, did you train with lrs2 and lrs3 for the checkpoint you released or just with lrs2?

Weizhi-Zhong commented 1 year ago

Hi~, thanks for your interest. The released model is trained just with lrs2. As depicted in our paper, we sample 45 videos from LRS3 to test the generalization ability of the method, not train.

ghost commented 1 year ago

Hey! Thanks your dedication, I got issues in processing part, just resize the sketch image to 384 and then resize to 128 that led to loss information in sketch images

ghost commented 1 year ago

image I upgraded the resolution to 384 and got the noises at step 18k. I train with small batch size 24, does it matter?

Weizhi-Zhong commented 1 year ago

image I upgraded the resolution to 384 and got the noises at step 18k. I train with small batch size 24, does it matter?

Hi, can you give more details about how you upgraded the resolution? As well as small batch size, other factors may affect the results.

ghost commented 1 year ago

I just basically run on your code, just change the input size of the denseflow network and add some layers the detail implement here: https://github.com/primepake/IP_LAP/commit/7dcd80c2ee4494abbfbea2817d67e956a6def06b

Weizhi-Zhong commented 1 year ago

Hi, thanks for your interest. ​If the resolution is increased, PerceptualLoss and GANloss may be modified accordingly, as well as the model architecture.

ghost commented 1 year ago

oh great! that's good idea thank for your dedication

ghost commented 1 year ago

btw, do you have plan for release pretrained LipLMD? That should be necessary for evaluation stage

Weizhi-Zhong commented 1 year ago

chan Hi, the code of evaluation borrows from this repo.

Because we have so many other things to do every day, we may not have enough time to contribute to this repository, for which we are deeply sorry.

ghost commented 11 months ago

I reproduced the code of landmark generator and test on your pretrained landmark but got the same issue here: the mouth can't close when encountered the silence speech but it's depended on the identity of the face that mean I run the audio with difference identify face some can close some can't. I also try with difference Nl contents. What do you think about this issue?

https://github.com/Weizhi-Zhong/IP_LAP/assets/39094983/bd4c70c2-ae5a-4242-88ee-11d2286fa106

https://github.com/Weizhi-Zhong/IP_LAP/assets/39094983/cf07c109-8371-4274-9901-f2c47067ecd0

tailangjun commented 6 months ago

I just basically run on your code, just change the input size of the denseflow network and add some layers the detail implement here: https://github.com/primepake/IP_LAP/commit/7dcd80c2ee4494abbfbea2817d67e956a6def06b

Is it clearer after you increase the number of network layers (I know your masterpiece Wav2Lip-288x288 is based on this idea), and have all the problems you encountered have been solved?

ghost commented 6 months ago

it looks the author has more network that can apply sync loss like syncnet but they yet pushed it