Open strfic opened 2 years ago
Have you solved this problem and can this method be deployed in a real-time scenario
The same question here. Really want a realtime one
@weijiang2009 @strfic @110wuqu It could be possible, 1) 200msec audio and 1 image 2) pass input through and model and you will get animated image.
has anybody experience with making wav2lip produce a realtime lipsync video stream so you can stream it?
@rizwanishaq can you elaborate what you mean by 200ms audio and 1 image? Please show the code to clarify if possible.
me too.
I've been struggling to find an implementation I trained a RAD-NeRF model that should be real time inference but I haven't gotten it that fast and my model trained a little weird. Other repos like geneface and livespeech portraits also claim similar ability but have been too hard to implement.
I've been struggling to find an implementation I trained a RAD-NeRF model that should be real time inference but I haven't gotten it that fast and my model trained a little weird.
Congratulations, you are one step ahead! Can you share how you trained it?
Other repos like geneface and livespeech portraits also claim similar ability but have been too hard to implement.
The second one takes 4 min for each inference, so it's easy too slow for real time. I wonder if any there are newer alternatives out there.
Wav2lip is like 20 seconds maybe for me. I've made makeittalk work on collab but it was like one minute ish maybe way faster on local hardware. RAD-NeRF/makeittalk/Wav2lip might be the fastest current ones. I think the awesome-talking-heads repo is good to check out. DIFF talk and diffusion heads will probably be faster maybe. Training repo is training-RAD-NeRF I made it a week ago. it's not super clear but if you search repo there a line with TWO_D that is changed and I wrote some instructions in comments. https://github.com/gloomiebloomie/training-RAD-NeRF.git
Wav2lip is like 20 seconds maybe for me.
Really? It takes about 5s for me for one sentence. but I made some modifications and run it on an M2 chip.
I've made makeittalk work on collab but it was like one minute ish maybe way faster on local hardware. RAD-NeRF/makeittalk/Wav2lip might be the fastest current ones. I think the awesome-talking-heads repo is good to check out. DIFF talk and diffusion heads will probably be faster maybe. Training repo is training-RAD-NeRF I made it a week ago. it's not super clear but if you search repo there a line with TWO_D that is changed and I wrote some instructions in comments. https://github.com/gloomiebloomie/training-RAD-NeRF.git
Great, thanks a lot for the tips. I'll check them and cycle back to you.
Dang what are the ways you made it that fast. I should just use wav to lip well I don't have an m2 chip but a decent gaming pc. I think makeittalk could be shortened to maybe faster times I updated their colab in a repo recently to make it run on current python version supported in colab. https://github.com/iboyles/makeittalknow.git
I think you can make this repo realtime by using a picture or a few frames of video(very short(2-5seconds low fps - 25 a little bit longer than real time). instead of video by passing it to --face arguement when running the inference.Also making sure the video is 512by512 or 720by720 is the max resolution for speed imo. But using just a png it just moves mouth but it ran real time on my 3070 8gb gpu.
@agilebean is there any way you could help me with the modifications you did for the Apple ARM? Thank you in advance!
has anybody experience with making wav2lip produce a realtime lipsync video stream so you can stream it?
@rizwanishaq can you elaborate what you mean by 200ms audio and 1 image? Please show the code to clarify if possible.
I did this like, I have 200ms of audio samples, and then 1 image, that's how I am able to run this in realtime, actually, I am able to run this with 20fps, which is good enough for me.
@VictorMotogna I'm sorry I should have called the modifications "ugly hack". I sliced the frames and sent them in batches as I thought that would reduce latency by batch-wise sending. But in the end, I think no time reduction. I recommend using a jpg or png image which is reduced in file size.
To anyone: Doesn't it bother you the most that the wav2lip is trained on low resolution images and thus looks really bad on faces with normal resolution (> 75 dpi)?
Has anyone gotten this to work? I have a solid PC but even with a very short audio file and 1 image it didn't seem to add anything to the results folder and was stuck at 0%. I have CUDA 11.6 but it didn't seem to be using it and gave me a warning about not using my GPU. I'm trying to use either Wav2Lip or LiveSpeechPortraits to make a real time chatbot similar to how callAnnie works but no luck. Any help would be much appreciated!
Use google colab to get free gpu usage. I would say using wav2lip with gpu , and a 1.3b-2.7b pramater llm with context history works for this. Use coqui tts for text to speach and medium whisper for getting speech recognized.
On Mon, Mar 18, 2024 at 11:12 AM AndrewInStage @.***> wrote:
Has anyone gotten this to work? I have a solid PC but even with a very short audio file and 1 image it didn't seem to add anything to the results folder and was stuck at 0%. I have CUDA 11.6 but it didn't seem to be using it and gave me a warning about not using my GPU. I'm trying to use either Wav2Lip or LiveSpeechPortraits to make a real time chatbot similar to how callAnnie works but no luck. Any help would be much appreciated!
— Reply to this email directly, view it on GitHub https://github.com/Rudrabha/Wav2Lip/issues/358#issuecomment-2004181933, or unsubscribe https://github.com/notifications/unsubscribe-auth/A27V6FMXLLC7UVZJTMQYVSLYY377VAVCNFSM5O5PXDXKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGQYTQMJZGMZQ . You are receiving this because you commented.Message ID: @.***>
Okay, I will look into that. Thanks! @gloomiebloomie
I could also send you a script I made that does this on my local pc at 8gb gpu you just have to change some of the module packages for coqui tts to allow the input speaker wav for yourtts model as non cmd arguement. I can send it later today as I need to rebuild it and get setup instructions. But I have one version that you can type to or speak to with the mic.
On Mon, Mar 18, 2024 at 11:56 AM AndrewInStage @.***> wrote:
Okay, I will look into that. Thanks! @gloomiebloomie https://github.com/gloomiebloomie
— Reply to this email directly, view it on GitHub https://github.com/Rudrabha/Wav2Lip/issues/358#issuecomment-2004309687, or unsubscribe https://github.com/notifications/unsubscribe-auth/A27V6FPFRPIH3GRDLZBSSDLYY4FEPAVCNFSM5O5PXDXKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQGQZTAOJWHA3Q . You are receiving this because you were mentioned.Message ID: @.***>
That would be amazing! Thanks!
@gloomiebloomie can you provide that script? would truly appreciate it.
Alright I will fork repo and put code/instructions there give me 15 min
gloomiebloomie/Wav2Lip_realtime_facetime I just forked the repo for anyone that wants to do the faster real time on pc windows machine. can def work for mac too I just don't know set up. Let me know on that repo I made if you guys stuck or have any questions. go to setup and instructions for cli commands to set it up , the read me has some general text at the top to explain what it is and video resizing at 512x512.
Okay, I will look into that. Thanks! @gloomiebloomie
Try my repo I have a setup for anyone to use now with gloomiebloomie/Wav2Lip_realtime_facetime on github
I write a blog for real-time Wav2Lip implementation: https://medium.com/@bigy2020real/a-holiday-experiment-developing-a-real-time-digital-human-interface-for-llms-ff2e7f3ebc8a
@gloomiebloomie thanks for sharing your github repo, and congrats and making it to a real business website! Can you tell if the latter has some advantages over the github repo, e.g. higher resolution?
@hzx829 thanks for your sharing the medium article. it looks pretty good. however, i also noticed that it's only a small face so wav2lip works with the low resolution. have you also trained a model on higher resolution?
I will be posting a youtube video on how to setup my repo soon tomorrow, but for now higher resolution is not possible for a few reasons, the high resolution models they have are commercial(even though they claim a pretty low resolution), 2 the videos the initial models(wav2lip.pth/gan model) were trained on 512-720p square vids or similar resolution, increasing this will off course lead to downgrade in performance, 3 even if we could make it higher resolution you would need a gpu that could run and process it fast most people have an 8gb gpu. To get fast generation you have to keep it low res and not many frames. If you really want something like this I would suggest magnific upscaling or latent diffusion upscaler. Just create it at low res then use a quality upscaler. There are also other ways to get high resolution talking heads as well like using deepfacelive repo. Also it isn't a business it is an open source implementation of multiple ml models that I developed over the past year. But yes you can also train your own model on a higher resolution dataset that would help if you have time and gpu power to do it.
@gloomiebloomie thanks for the explanation. it makes sense to avoid higher resolution for better performance.
sorry, i thought [business website](the https://synclabs.so/pricing ) was yours. but i think it's great to make opensource as a business as long as one provides a free version.
about the upscaling: it's a good idea but I can't do that because of the realtime requirement.
what i found the most interesting is that you have 720p in the training set in 512x512 resolution - that is much higher than I had in my download of wav2lip. was that specific to the gan model?
Yeah mine is just the github repo so I'm not hosting app anyway I just do this for fun. I don't see any money coming out of it I love opensourcing anything I do so others can use it. But I see why upscaling isn't an option. I use a little less than 512 by 512(480x480 or something) in inference. In the cli arguments you can reduce the resolution of the output. I just recorded my sample of my face at 720(since the app i used didnt have 512 option) before I use a script to resize to 512x512. The sync labs is the people that made this repo also they say resolution is like 200x190 something for that which isn't really hd. I think using 720x720 is probably the best option for you unless you want to train on dataset with 1024x1024. There are some upscalers that work fast(maybe even realtime) esrgan or something I forget what it is called.
@gloomiebloomie thanks a lot for the details you gave. can you maybe share your 480x480 model? this seems to be the sweet spot for realtime. i think it is still much better than the original model i used as it is only 96 x 96.
I just use the wav2lip_gan.pth model they have I didn't train mine, that's just the output it gives. There is a cmd arg on wav2lip that lets you resize to any size even given video size like this," --resize_factor', default=1, type=int, help='Reduce the resolution by this factor. Sometimes, best results are obtained at 480p or 720p')" this is from the inference script I think my inference script has these as well. https://youtu.be/j024dXM-8FI?si=0-sjKM9NLfYrEPnH Here is my youtube video showing how to setup and install my ui/branch of wav2lip then once you have it installed and working you can set up your own face into it. Also just reducing the framerate helps I think 25fps is what they used, My clip is like 3-5 seconds, 25fps 512x512 then output to 480p. You could also reduce batch size to get an even faster generation using --wav2lip_batch_size and --face_det_batch_size but it will lead to some loss. Also remember its possible to just toss a single frame for this so you can send and image to generation of wav to lip and make it run faster as well although there will be some loss there too but just keep adjusting to you have your sweet spot.
I write a blog for real-time Wav2Lip implementation: https://medium.com/@bigy2020real/a-holiday-experiment-developing-a-real-time-digital-human-interface-for-llms-ff2e7f3ebc8a
can you provide the script?
I write a blog for real-time Wav2Lip implementation: https://medium.com/@bigy2020real/a-holiday-experiment-developing-a-real-time-digital-human-interface-for-llms-ff2e7f3ebc8a
Please share the script
How can we create a real time Wav2Lip? For example from a wav file or live mic audio or TTS? Is it feasible using Wav2Lip? If yes, please provide the script. This feature can be very useful. We provide the wav file and instead of producing the output video file the script should display the video output playback in real time (live).