Need impeccable Lipsync, Have tried GFP and GPEN-GAN post processing on each frame

Inferencer / LipSick

🤢 LipSick: Fast, High Quality, Low Resource Lipsync Tool 🤮

The Unlicense

167 stars 25 forks source link

Need impeccable Lipsync, Have tried GFP and GPEN-GAN post processing on each frame #12

Closed kalpesh-maru closed 4 months ago

kalpesh-maru commented 4 months ago

I am trying to get most realistic lip-syncing; I have tried different models but feel this is achievable with DINET. I have integrated GFP-Gan and GPEN-Gan in the code such that it enhances full frames after lip-syncing (Will share that code soon) and sharing the output. Other output is from a third-party service with is paid, any more suggestions as to how to get better LipSync.

Also, will it be possible to connect via telegram, email or any other platform @Inferencer , I have been working on this for a while and have some ideas I would like to discuss with you.

Inferencer commented 4 months ago

Sounds good, we can also upscale 1-5 reference frames and they will be used for the entire vid which would save need to upscale each frame in post.

My end goal after all the features and upgrades are implemented will be to simply use these results as a driver for a one shot project or a decent gaussian until we get 3dmm's that can be personalized to actors speaking + expression style but can discuss this more on discord if you use that app, my user name is Inferencer

kalpesh-maru commented 4 months ago

I have tried few things, I am enhancing reference image with GPEN and GFPGAN as you suggested, and I am manually selecting good Images for reference which is having model's mouth open and showing her teeth, this is the result, However, I feel we can achieve the output to be like the paid service, I am sharing the results.

This is the output form paid service, The lip-sync is realistic most people won't be able to distinguish it from reality.

https://github.com/Inferencer/LipSick/assets/135098715/26913925-c81e-41e7-9eec-5161b8c82b90

This is our output, what i have changes is sending 0 numpy arry as deepsech tensor when frame is silent, Enhancing Handpicked Reference Images with GFPGAN and GPEN and Postprocessing full frames with GPEN for better blending around neck.

https://github.com/Inferencer/LipSick/assets/135098715/062627fa-3537-4c63-afb7-d04b3478fa30

Inferencer commented 4 months ago

I wouldn't recommend using an upscaler in post production due to temporal inconsistency, I use up-scaling for dimly lit teeth of the ref frames but in this example I wouldn't need to as you would have studio lighting without shadows etc,

When I do custom ref frames I normally just use one frame for all 5, as 5 reference frames are the min and max, there is also some padding in the code that may interfere with your selection, for instance if I chose reference frame 20 and choose to use it for all 5 reference frames i would do 20, 20, 20, 20, 20 but with padding that might end up being 15, 15, 15, 15, 15.

A blue background is also a good choice as it makes the face box less obvious until I implement box removal, although funnily enough the majority of the dataset was with a green background so might be worth testing that too.