MRzzm / DINet

The source code of "DINet: deformation inpainting network for realistic face visually dubbing on high resolution video."
895 stars 167 forks source link

I did an experiment and did not get very good precision ( the face KEPT TALKING when there was a SILENCE) #87

Open AIhasArrived opened 5 months ago

AIhasArrived commented 5 months ago

How to fix that?

Inferencer commented 5 months ago

try adding a sound to the audio file that's constant such as a low pitch .wav on the silent parts should fix this, alternatively choose a video file that is not talking or at least not talking during the parts you want him to be silent

AIhasArrived commented 5 months ago

Hello @Inferencer Thanks for the answer, But I don't have the choice, there are some videos I want to "affact" and change their lips, I cannot really choose another video. I don't know if I can change the audio to have same pitch, not sure how to handle this or if it will be any use. Your name, I think I read it elsewhere, but I don't know where.

I have aanother @Inferencer please, are there any recommandations for the csv file? I only did one experiment, and obtaining one csv file then used it to change the voice, but I DID NOT PLAY with any paramter, would you know how?

ANother important question @Inferencer would you know how to make "video retalker" waay faster? (Its more precise than dinet, but SOOO SLOOOW), would you know any method, by using more gpu or more cpu etc? Just need it to be faster please! (I am talking about video retalker with this last question)

Inferencer commented 5 months ago

Hello @Inferencer Thanks for the answer, But I don't have the choice, there are some videos I want to "affact" and change their lips, I cannot really choose another video. I don't know if I can change the audio to have same pitch, not sure how to handle this or if it will be any use. Your name, I think I read it elsewhere, but I don't know where.

I have aanother @Inferencer please, are there any recommandations for the csv file? I only did one experiment, and obtaining one csv file then used it to change the voice, but I DID NOT PLAY with any paramter, would you know how?

ANother important question @Inferencer would you know how to make "video retalker" waay faster? (Its more precise than dinet, but SOOO SLOOOW), would you know any method, by using more gpu or more cpu etc? Just need it to be faster please! (I am talking about video retalker with this last question)

first thing to try is cleaning up the audio you are using with a de-noise tool, most video applications have them or you can use a free online tool

second thing to try is something I mentioned earlier but was misunderstood, imagine you have a long continual beep sound, the speech recognition module will recognize this and keep the mouth still, the higher the pitch of this sound the more the mouth will be open and the lower the pitch the more it will be closed, my idea for you was to edit the audio so you have the normal talking then just add this beep sound to the silent parts, you dont need to change the pitch of the original audio just add the sound.

you mention you haven't got a csv working yet so not sure how you know if the video is talking on silent parts yet, the correct settings are on the repo's main page, if you have more issues with that you can add me on discord and we can look at the issue, username: Inferencer

I haven't tried retalker in a very long time I remember how slow it was and i remember people complaining that cuda didn't speed it up but it might be worth looking at forks to see if somebody has fixed that, although I must say the quality of retalking is poor & only works best when the person in the video is not talking

AIhasArrived commented 5 months ago

The quality of output might be poor but the lips are phenomenallly precise! The low output quality can then be upscaled easilty or modified! That's not a problem So I am still interested by video retalking as it had the best lips sync of them all. I tried forks and saw nothing, Maybe I did not do the right search, if you see anything you tell me. As for the csv, I made it but I am not sure if I need to modify things on it to make the final DInet result better or I just need to make the csv as I did without cthinking much. Thanks so much for your suggestion and discord. If I ever contact you I will make sure to send you a message here before hand (other wise that would not be me)

The idea with the beep is quite interesting. What you mean is the beep should be added on every ms (silent or not) then right? Would you have a particular sound to suggest to perform this mission?

Thanks!

Inferencer commented 5 months ago

no put beep just on the silent parts, whats is doing is telling the speech recognition modal that there is a constant sound as it does not know what to do with silence

I haven't tried editing the csv and wouldn't recommend it

you can actually do other stuff to improve quality, for instance Dinet chooses 5 random reference frames to map the mouth so the results you see are a result of those reference frames it has picked from your video, i changed the code to allow me to choose which frames I want selected and i make sure the ones I pick are clear and not blurry or anything.

you can also do some other stuff so the face mask isn't jittering so much as in one case I detected the average frame crop radius by printing the values on each frame and in my case it was 124 I then set the crop radius to 124 and the mask no longer zoomed in and out alot, it still moved up and down but if it didn't then it wouldn't work well with a moving face

AIhasArrived commented 5 months ago
Inferencer commented 5 months ago

I'll upload everything to google drive for you, it includes a beep sound you can test on its own, if you need to make it longer just loop it in a video/ audio editing app. I put a silent audio file in there too just in case there was an issue with your original audio you should test that too. https://drive.google.com/drive/folders/1q-iuU39oB3drzPg897wJOgg2kn7Dna1d?usp=sharing

the 5 reference frames are chosen in inference.py I have uploaded a editing file called inference-frames.py on line 86 you will see ref_index_list = [39, 105, 148, 165, 170] just change those numbers to whatever frames you want to use, but be warned currently it might say ref index list is out of range as if you use a shorter audio file than the length of you video then the frames it can choose from are from the shortend video so its best to test this with an audio file that's longer than your video or choose only frames that would be included in the length of your audio you can call this script on inference by changing the .py name as the start for example python inference-frames.py --mouth_region_size=256 --source_video_path=./asserts/examples/ken69.mp4 --source_openface_landmark_path=./asserts/examples/ken69.csv --driving_audio_path=./asserts/examples/ken69.wav --pretrained_clip_DINet_path=./asserts/clip_training_DINet_256mouth.pth

and yes the crop radius your are correct we find the crop radius it is using on a particular face, it normally does a radius on say 10 frames then changes on the next 20 frames if there is movement so we figure out the average radius it is using, I will have to re-test the code i have for that as its in a difference inference.py i have edited to make sure It still works before I share that one

AIhasArrived commented 5 months ago

Sometimes it just amazes me how awesome people are,

I am currently working on finishing a little project, I was doing DInet and others in hardcore mode (I mean I was doing only that), and finished/stoped by posting this "issue"/post, I will get back to it later after I have tested your method:), just after I finished the little project. Thanks again, Tell me more when you have figured out the crop radius final code if you wish, thanks.

Now that I remember, I rmember puting the checkpoints on the folder (unless I am confusing with another repository or with openface), and I rmember having the code working without one of the checkpoints (I added them one by one until the I had no errors, and I still had one checkpoint left, I added it anyway but I wonder if it had to be copy pasted in another folder maybe, i wonder)

Inferencer commented 5 months ago

uploaded the crop radius stuff, same as last time to use these files we just change the name of the inference.py to the file name you wish to run and leave everything else as normal

the first file is called print_radius.py so it will be python print_radius.py --mouth_region_size=256 --source_video_path=./asserts etc etc etc and you don't need to edit it, its just running normal inference but its also telling you what crop radius is being used on each frame and also collecting those crop radius numbers and calculating the average one rounded to the nearest whole number so you should see something like this near the end of your command prompt results (you will need to scroll up a bit at the end to see final result)

Crop Radius for Frame 454: 92
Average Crop Radius for all frames: 98

in this case the number you want it 98

the second file is called change_crop_radius.py so it will be python change_crop_radius.py --mouth_region_size=256 --source_video_path=./asserts etc etc etc it slows down a bit at the start sometimes but it's not frozen so don't worry, sometimes it does some backend calculations then spits everything out really fast

we edit the line number 127 fixed_crop_radius = 136 # Set your desired fixed crop radius here and change the value of 136 to whatever is best for your face (the number we got earlier so in this case I would change it to 98)

I could be wrong and this code might still not function as expected as I just rewrote it and did a quick test, try it with a face that hardly moves and see how you get on, in regards to the checkpoints some of those are just used for training purposes

AIhasArrived commented 5 months ago

This is amazing stuff, as I said as soon as I finish a project I will go back to this and tell you how it worked for me and tell you if I have a question (this week or week end), I have a qquick question, do you know much about the "training" section on the readme? I keep seeing these parts on every repo and I never understand its use, are we supposed to .. retrain.. with our own data to obtain "mayeb better" results? Maybe obtain new "checkpoints" tht we like better? (provided we have some hardware?) I am never sure about that section, I remember reading another repo where you are just suposed to train and thats it (nothing about using already existing checkpoints), I find all of this fascinating

Inferencer commented 5 months ago

I haven't seen anybody find a solution to continue training existing checkpoints, I tried training a person specific dataset a while back but without charts etc its hard to know when its all done, when somebody releases code related to training that would help with that I'll let you know as I know a few people attempting it and failing, while the people succeeding are keeping their training code private as the are doing it for commercial projects

Inferencer commented 4 months ago

I forgot to mention when picking custom frames you might need to -5 or +5 on each number I cant remember if its - or + so if the frame you want is 20 then you pick 15 or 25 i'm pretty sure its minus tho so in the example i gave you ref_index_list = [39, 105, 148, 165, 170] would be ref_index_list = [34, 100, 143, 160, 165]

AIhasArrived commented 4 months ago

Hello @Inferencer I just finishedmy little project, I thought I was going to spend 1 day on it every day lol, I spent 8 more days! lol Today I can experiment with the beep lool Sorry for taking this much time. Ok I can't wait, I will tell you how it goes in the next 24h

Ok so to to test it out , I will have to go back to DINet, try direclty your video (you said you put one that was silent right?, I assumre it has a person moving his lips but no soundright?) In that case, I need to use that video in openface maybe to obtain csv, then open the video in a editing software, to add the beep when there is silence. , then go back to DInet, and laucnh the inference and IT SHOULD give me a result where the lips do NOT MOVE during the beep, did I get that rigth?

In addition to that you have me 2 scripts, with changes, one that let it take different frames that the usual it take, as default, but where you need to make sure the video is longer than the iid of the frame I think. And another about the lips crop radius I think;

I am just getting back to this after 8 or more days lol! I will monitor this conversation during the day, see you

Inferencer commented 4 months ago

I have been experimenting a lot the past few days and it seems dinet does detect the silence so am not sure why yours is still talking perhaps you could share the vid and audio with me so i can test