Some problems about lip sync

Rudrabha / LipGAN

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

http://cvit.iiit.ac.in/research/projects/cvit-projects/facetoface-translation

MIT License

578 stars 122 forks source link

Some problems about lip sync #21

Closed tju-zxy closed 4 years ago

tju-zxy commented 4 years ago

Hi, @prajwalkr. Thanks for sharing the revolutionary work. However, when I run the code and input the same image which you gave in the previous issues, I cannot get a satisfactory result. My result_video is listed as follows. Could you give me some advice on improving the result or correcting my possible mistakes? Thanks a lot. https://www.youtube.com/watch?v=beuf71Wrg3g

prajwalkr commented 4 years ago

Hello, for some reason the face is not detected properly. This is not a failure of LipGAN but rather of face detection. You can adjust the detected box with padding with this parameter: https://github.com/Rudrabha/LipGAN/blob/17ee347ebacb9c79f36ef978c22bc494ad1a9546/batch_inference.py#L25

You can find an example mentioned in another similar issue: https://github.com/Rudrabha/LipGAN/issues/14#issuecomment-595087268

Please experiment with this padding a little bit to ensure the detected face box covers most of the face.

tju-zxy commented 4 years ago

Thanks for your help! It indeed works. Your advice is very valuable. Now, the image can generate a decent result. However, when I input the video and the audio, the lip movements of the generated video are almost like those in the source video. I has tried to extract some frames from the video to test the model. The result generated from the frame is acceptable, So I sincerely hope you give me some advice on improving the result generated from a video. My result is listed as follows. The souce video : https://youtu.be/vM2HlaztgCM The result generated from video: https://youtu.be/3b9p1h7Df4c The result generated from frame: https://youtu.be/8YFsRaRJaPo Thanks a lot !

prajwalkr commented 4 years ago

Hello

Glad the result improved. For the result generated from a single-frame, I think you can improve it further if you correct the padding to cover just until the chin at the bottom and the sides of the face.

Results from a static frame will always be superior compared to results on moving frames. As ours is a frame-based model, you will observe temporal inconsistencies and thus you will observe poor results in some cases, especially during silences. We are working on a future work to resolve these issues and we will update this repo accordingly.

ak9250 commented 4 years ago

@tju-zxy have you tried https://github.com/yiranran/Audio-driven-TalkingFace-HeadPose for your video input?

shikhar-scs commented 4 years ago

Hey @Rudrabha , for a different problem, if I want to mask the complete face with the ground truth in the face encoder, do I make any changes in the pre-process part or only in the batch_inference part, adjusting the paddings ?

prajwalkr commented 4 years ago

want to mask the complete face with the ground truth in the face encoder

I am sorry, I do not understand. Please explain more. But I can assure you nothing is to be done in the preprocess part.

shikhar-scs commented 4 years ago

Hey, I probably figured out that part. No worries. Meanwhile if you could have a look at https://stackoverflow.com/questions/61608295/attributeerror-nonetype-object-has-no-attribute-inbound-nodes-add-conv-la, it would be great help

prajwalkr commented 3 years ago

However, when I input the video and the audio, the lip movements of the generated video are almost like those in the source video

Please switch to this latest improved work: https://github.com/Rudrabha/Wav2Lip :-)