Elsaam2y / DINet_optimized

An optimized pipeline for DINet reducing inference latency for up to 60% 🚀. Kudos for the authors of the original repo for this amazing work.
93 stars 15 forks source link

Bad result when person has beard, objects, movement. Here some examples #11

Closed davidmartinrius closed 9 months ago

davidmartinrius commented 9 months ago

Hello!,

I hope all of you are well. I share this 3 inferenced videos with some problems I found with DINet.

As the title says, when the person has beard, or there are objects passing in front of the face, or the person is moving the results are bad.

Video with person in movement: https://github.com/Elsaam2y/DINet_optimized/assets/16558194/6e8ab908-4a04-4090-8daa-7bb0dff83879

Hair in front of the mouth: https://github.com/Elsaam2y/DINet_optimized/assets/16558194/08d6c32c-e79c-41fd-8d6c-8b9d59704897

Man with beard: https://github.com/Elsaam2y/DINet_optimized/assets/16558194/fd66d03c-cb48-41fb-aa64-64e4bb70898d

Said that, do you know how to solve this? Is this a limitation of DINet? Or maybe it happens because the model (Both syncnet and DINet) needs more training? These 3 videos are unseen in training, so actually I don't know if more training could solve these problems. The example videos don't have this features.. (beard or movement)

Note: if you click any of the links, it will open a new window with audio only. In that window you can download the file and you will be able to watch it.

Thank you!

David Martin Rius

Elsaam2y commented 9 months ago

Hi,

this is actually a general limitation of most lipsync pipelines. The problem is with the face detection and whenever there is any obstacles in front of the face or even hair this would cause some troubles. This should apply for the first 2 videos. Regarding the beard video, I believe the problem could be also related to the skin color which might not be similar to the dataset used for training DINet (HDTF). Hence training or fine-tuning on similar data could help.

Furthermore, the first 2 problems could be also fixed with fine-tuning the model on these specific videos, or trying to train afterwards an image to image translation model to fix these issues. However, this would be very specific to the character in the video and might be challenging a bit.

davidmartinrius commented 9 months ago

Hi,

this is actually a general limitation of most lipsync pipelines. The problem is with the face detection and whenever there is any obstacles in front of the face or even hair this would cause some troubles. This should apply for the first 2 videos. Regarding the beard video, I believe the problem could be also related to the skin color which might not be similar to the dataset used for training DINet (HDTF). Hence training or fine-tuning on similar data could help.

Furthermore, the first 2 problems could be also fixed with fine-tuning the model on these specific videos, or trying to train afterwards an image to image translation model to fix these issues. However, this would be very specific to the character in the video and might be challenging a bit.

Hello! thanks for your detailed response. So, I think the best way to try to solve this is a finetuning with these videos. Until I don't do this I won't know if it solve the problem.

Thank you!

oijoijcoiejoijce commented 9 months ago

what image to image translation model have you guys experienced with that is succesful? I tried a few super resolution but didn't get good results

davidmartinrius commented 9 months ago

what image to image translation model have you guys experienced with that is succesful? I tried a few super resolution but didn't get good results

Hi @oijoijcoiejoijce , I don't understand you, what do you mean? Can you explain it with more details?

Thanks

Elsaam2y commented 9 months ago

@oijoijcoiejoijce You can try an auto encoder which trains on pairs of the DINet output and the original frames. I didn't develop something in this direction but this is what I could think of to try fixing this problem. And super resolution models won't help that much here. On the contrary they would significantly enhance the artifacts in this case, and hence the output could even look worse.