Rudrabha / Wav2Lip

This repository contains the codes of "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020. For HD commercial model, please try out Sync Labs
https://synclabs.so
9.74k stars 2.11k forks source link

How to create a real time (live) Wav2Lip implementation? #358

Open strfic opened 2 years ago

strfic commented 2 years ago

How can we create a real time Wav2Lip? For example from a wav file or live mic audio or TTS? Is it feasible using Wav2Lip? If yes, please provide the script. This feature can be very useful. We provide the wav file and instead of producing the output video file the script should display the video output playback in real time (live).

110wuqu commented 2 years ago

Have you solved this problem and can this method be deployed in a real-time scenario

ghost commented 2 years ago

The same question here. Really want a realtime one

rizwanishaq commented 2 years ago

@weijiang2009 @strfic @110wuqu It could be possible, 1) 200msec audio and 1 image 2) pass input through and model and you will get animated image.

agilebean commented 1 year ago

has anybody experience with making wav2lip produce a realtime lipsync video stream so you can stream it?

@rizwanishaq can you elaborate what you mean by 200ms audio and 1 image? Please show the code to clarify if possible.

alaamh commented 11 months ago

me too.

iboyles commented 11 months ago

I've been struggling to find an implementation I trained a RAD-NeRF model that should be real time inference but I haven't gotten it that fast and my model trained a little weird. Other repos like geneface and livespeech portraits also claim similar ability but have been too hard to implement.

agilebean commented 11 months ago

I've been struggling to find an implementation I trained a RAD-NeRF model that should be real time inference but I haven't gotten it that fast and my model trained a little weird.

Congratulations, you are one step ahead! Can you share how you trained it?

Other repos like geneface and livespeech portraits also claim similar ability but have been too hard to implement.

The second one takes 4 min for each inference, so it's easy too slow for real time. I wonder if any there are newer alternatives out there.

iboyles commented 11 months ago

Wav2lip is like 20 seconds maybe for me. I've made makeittalk work on collab but it was like one minute ish maybe way faster on local hardware. RAD-NeRF/makeittalk/Wav2lip might be the fastest current ones. I think the awesome-talking-heads repo is good to check out. DIFF talk and diffusion heads will probably be faster maybe. Training repo is training-RAD-NeRF I made it a week ago. it's not super clear but if you search repo there a line with TWO_D that is changed and I wrote some instructions in comments. https://github.com/gloomiebloomie/training-RAD-NeRF.git

agilebean commented 11 months ago

Wav2lip is like 20 seconds maybe for me.

Really? It takes about 5s for me for one sentence. but I made some modifications and run it on an M2 chip.

I've made makeittalk work on collab but it was like one minute ish maybe way faster on local hardware. RAD-NeRF/makeittalk/Wav2lip might be the fastest current ones. I think the awesome-talking-heads repo is good to check out. DIFF talk and diffusion heads will probably be faster maybe. Training repo is training-RAD-NeRF I made it a week ago. it's not super clear but if you search repo there a line with TWO_D that is changed and I wrote some instructions in comments. https://github.com/gloomiebloomie/training-RAD-NeRF.git

Great, thanks a lot for the tips. I'll check them and cycle back to you.

iboyles commented 11 months ago

Dang what are the ways you made it that fast. I should just use wav to lip well I don't have an m2 chip but a decent gaming pc. I think makeittalk could be shortened to maybe faster times I updated their colab in a repo recently to make it run on current python version supported in colab. https://github.com/iboyles/makeittalknow.git

gloomiebloomie commented 11 months ago

I think you can make this repo realtime by using a picture or a few frames of video(very short(2-5seconds low fps - 25 a little bit longer than real time). instead of video by passing it to --face arguement when running the inference.Also making sure the video is 512by512 or 720by720 is the max resolution for speed imo. But using just a png it just moves mouth but it ran real time on my 3070 8gb gpu.

VictorMotogna commented 5 months ago

@agilebean is there any way you could help me with the modifications you did for the Apple ARM? Thank you in advance!

rizwanishaq commented 5 months ago

has anybody experience with making wav2lip produce a realtime lipsync video stream so you can stream it?

@rizwanishaq can you elaborate what you mean by 200ms audio and 1 image? Please show the code to clarify if possible.

I did this like, I have 200ms of audio samples, and then 1 image, that's how I am able to run this in realtime, actually, I am able to run this with 20fps, which is good enough for me.

agilebean commented 5 months ago

@VictorMotogna I'm sorry I should have called the modifications "ugly hack". I sliced the frames and sent them in batches as I thought that would reduce latency by batch-wise sending. But in the end, I think no time reduction. I recommend using a jpg or png image which is reduced in file size.

To anyone: Doesn't it bother you the most that the wav2lip is trained on low resolution images and thus looks really bad on faces with normal resolution (> 75 dpi)?

AndrewInStage commented 3 months ago

Has anyone gotten this to work? I have a solid PC but even with a very short audio file and 1 image it didn't seem to add anything to the results folder and was stuck at 0%. I have CUDA 11.6 but it didn't seem to be using it and gave me a warning about not using my GPU. I'm trying to use either Wav2Lip or LiveSpeechPortraits to make a real time chatbot similar to how callAnnie works but no luck. Any help would be much appreciated!

gloomiebloomie commented 3 months ago

Use google colab to get free gpu usage. I would say using wav2lip with gpu , and a 1.3b-2.7b pramater llm with context history works for this. Use coqui tts for text to speach and medium whisper for getting speech recognized.

On Mon, Mar 18, 2024 at 11:12 AM AndrewInStage @.***> wrote:

Has anyone gotten this to work? I have a solid PC but even with a very short audio file and 1 image it didn't seem to add anything to the results folder and was stuck at 0%. I have CUDA 11.6 but it didn't seem to be using it and gave me a warning about not using my GPU. I'm trying to use either Wav2Lip or LiveSpeechPortraits to make a real time chatbot similar to how callAnnie works but no luck. Any help would be much appreciated!

— Reply to this email directly, view it on GitHub https://github.com/Rudrabha/Wav2Lip/issues/358#issuecomment-2004181933, or unsubscribe https://github.com/notifications/unsubscribe-auth/A27V6FMXLLC7UVZJTMQYVSLYY377VAVCNFSM5O5PXDXKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGQYTQMJZGMZQ . You are receiving this because you commented.Message ID: @.***>

AndrewInStage commented 3 months ago

Okay, I will look into that. Thanks! @gloomiebloomie

gloomiebloomie commented 3 months ago

I could also send you a script I made that does this on my local pc at 8gb gpu you just have to change some of the module packages for coqui tts to allow the input speaker wav for yourtts model as non cmd arguement. I can send it later today as I need to rebuild it and get setup instructions. But I have one version that you can type to or speak to with the mic.

On Mon, Mar 18, 2024 at 11:56 AM AndrewInStage @.***> wrote:

Okay, I will look into that. Thanks! @gloomiebloomie https://github.com/gloomiebloomie

— Reply to this email directly, view it on GitHub https://github.com/Rudrabha/Wav2Lip/issues/358#issuecomment-2004309687, or unsubscribe https://github.com/notifications/unsubscribe-auth/A27V6FPFRPIH3GRDLZBSSDLYY4FEPAVCNFSM5O5PXDXKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGQZTAOJWHA3Q . You are receiving this because you were mentioned.Message ID: @.***>

AndrewInStage commented 3 months ago

That would be amazing! Thanks!

AwokeKnowing commented 2 months ago

@gloomiebloomie can you provide that script? would truly appreciate it.

gloomiebloomie commented 2 months ago

Alright I will fork repo and put code/instructions there give me 15 min

gloomiebloomie commented 2 months ago

gloomiebloomie/Wav2Lip_realtime_facetime I just forked the repo for anyone that wants to do the faster real time on pc windows machine. can def work for mac too I just don't know set up. Let me know on that repo I made if you guys stuck or have any questions. go to setup and instructions for cli commands to set it up , the read me has some general text at the top to explain what it is and video resizing at 512x512.

gloomiebloomie commented 2 months ago

Okay, I will look into that. Thanks! @gloomiebloomie

Try my repo I have a setup for anyone to use now with gloomiebloomie/Wav2Lip_realtime_facetime on github

hzx829 commented 2 months ago

I write a blog for real-time Wav2Lip implementation: https://medium.com/@bigy2020real/a-holiday-experiment-developing-a-real-time-digital-human-interface-for-llms-ff2e7f3ebc8a

agilebean commented 1 month ago

@gloomiebloomie thanks for sharing your github repo, and congrats and making it to a real business website! Can you tell if the latter has some advantages over the github repo, e.g. higher resolution?

@hzx829 thanks for your sharing the medium article. it looks pretty good. however, i also noticed that it's only a small face so wav2lip works with the low resolution. have you also trained a model on higher resolution?

gloomiebloomie commented 1 month ago

I will be posting a youtube video on how to setup my repo soon tomorrow, but for now higher resolution is not possible for a few reasons, the high resolution models they have are commercial(even though they claim a pretty low resolution), 2 the videos the initial models(wav2lip.pth/gan model) were trained on 512-720p square vids or similar resolution, increasing this will off course lead to downgrade in performance, 3 even if we could make it higher resolution you would need a gpu that could run and process it fast most people have an 8gb gpu. To get fast generation you have to keep it low res and not many frames. If you really want something like this I would suggest magnific upscaling or latent diffusion upscaler. Just create it at low res then use a quality upscaler. There are also other ways to get high resolution talking heads as well like using deepfacelive repo. Also it isn't a business it is an open source implementation of multiple ml models that I developed over the past year. But yes you can also train your own model on a higher resolution dataset that would help if you have time and gpu power to do it.

agilebean commented 1 month ago

@gloomiebloomie thanks for the explanation. it makes sense to avoid higher resolution for better performance.

sorry, i thought [business website](the https://synclabs.so/pricing ) was yours. but i think it's great to make opensource as a business as long as one provides a free version.

about the upscaling: it's a good idea but I can't do that because of the realtime requirement.

what i found the most interesting is that you have 720p in the training set in 512x512 resolution - that is much higher than I had in my download of wav2lip. was that specific to the gan model?

gloomiebloomie commented 1 month ago

Yeah mine is just the github repo so I'm not hosting app anyway I just do this for fun. I don't see any money coming out of it I love opensourcing anything I do so others can use it. But I see why upscaling isn't an option. I use a little less than 512 by 512(480x480 or something) in inference. In the cli arguments you can reduce the resolution of the output. I just recorded my sample of my face at 720(since the app i used didnt have 512 option) before I use a script to resize to 512x512. The sync labs is the people that made this repo also they say resolution is like 200x190 something for that which isn't really hd. I think using 720x720 is probably the best option for you unless you want to train on dataset with 1024x1024. There are some upscalers that work fast(maybe even realtime) esrgan or something I forget what it is called.

agilebean commented 1 month ago

@gloomiebloomie thanks a lot for the details you gave. can you maybe share your 480x480 model? this seems to be the sweet spot for realtime. i think it is still much better than the original model i used as it is only 96 x 96.

gloomiebloomie commented 1 month ago

I just use the wav2lip_gan.pth model they have I didn't train mine, that's just the output it gives. There is a cmd arg on wav2lip that lets you resize to any size even given video size like this," --resize_factor', default=1, type=int, help='Reduce the resolution by this factor. Sometimes, best results are obtained at 480p or 720p')" this is from the inference script I think my inference script has these as well. https://youtu.be/j024dXM-8FI?si=0-sjKM9NLfYrEPnH Here is my youtube video showing how to setup and install my ui/branch of wav2lip then once you have it installed and working you can set up your own face into it. Also just reducing the framerate helps I think 25fps is what they used, My clip is like 3-5 seconds, 25fps 512x512 then output to 480p. You could also reduce batch size to get an even faster generation using --wav2lip_batch_size and --face_det_batch_size but it will lead to some loss. Also remember its possible to just toss a single frame for this so you can send and image to generation of wav to lip and make it run faster as well although there will be some loss there too but just keep adjusting to you have your sweet spot.

kulomady commented 1 month ago

I write a blog for real-time Wav2Lip implementation: https://medium.com/@bigy2020real/a-holiday-experiment-developing-a-real-time-digital-human-interface-for-llms-ff2e7f3ebc8a

can you provide the script?

Rithik53 commented 1 week ago

I write a blog for real-time Wav2Lip implementation: https://medium.com/@bigy2020real/a-holiday-experiment-developing-a-real-time-digital-human-interface-for-llms-ff2e7f3ebc8a

Please share the script