Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
10.18k stars 2.19k forks source link

Why do I use this model(wav2lip_gan_base.pth) to generate a video with a mask on it, which makes it unclear? #222

Closed iamchenxin-coder closed 3 years ago

iamchenxin-coder commented 3 years ago

Hello, I use this model to input the video generated by picture 1. I feel that there is a layer of mask on it, which is not clear. What's the matter? How can I remove it?sorry,the video can't be uploaded

prajwalkr commented 3 years ago

Sorry, I do not understand your question. What mask are you talking about? what is the picture 1?

iamchenxin-coder commented 3 years ago

Hello @prajwalkr, because my computer can't upload pictures with too much memory, I will reduce the image of the generated video and send it to you for a look. This picture is generated by wav2lip + Gan, but there is a layer of mask on it, which makes the whole video unclear,Especially when the size of the input image is large, the resolution of the generated video is higher, and the mask is more obvious. I especially want to know how to remove the mask. video_with_mask

The documents of the project are in operation inference.pyimage image image

Finally, I found this code in the function datagen, which can adjust the color level of the facial mask in the generated video. I don't know how to adjust img_masked value to get an invisible video effect,help tell me how to adjust here. The code of the red box in the following picture is added by myself. I want to see the effect of modifying this value. image

The generated video frame is shown in the figure below: generate_video_pic_img_mask=0

Rudrabha commented 3 years ago

We concatenate a reference frame (from any random time step) and the GT frame with the lower half masked (during training). This is to provide the model with the target head pose information without leaking the lip shape (GT frame contains the GT lip shape). During inference at any time step, we concatenate the lower half masked frame with itself. This is the masking step you find in this line. I think the face detection failed for this particular example. Probably the face was detected above the real face, and the lower half mask meant that the forehead region was masked out.