anothermartz / Easy-Wav2Lip

Colab for making Wav2Lip high quality and easy to use
674 stars 107 forks source link

Is it possible to generate faster? #47

Open gt2ming opened 7 months ago

gt2ming commented 7 months ago

Dear author,

This is a really nice project that generates much faster than wav2lip,

Now to generate a 30 second video, it only takes about 30 seconds,

My machine configuration: two 3090 graphics cards

Very nice, thank you very much.

I have a question I need to ask

  1. What method do you use to improve the generation speed of easy-wav2lip?

  2. Is it possible to make easy-wav2lip generation faster? Because I want to be a real time digital person, I needed a faster rendering speed.

Looking forward to your reply.

anothermartz commented 7 months ago

Credit for speedup (and reduces some glitches!) goes to this project:

https://github.com/devxpy/cog-Wav2Lip

They used mediapipe instead of dlib to track the face which is considerably faster and more accurate.

As for real-time, the code would need to be translated from taking a completed video file and completed audio file and accepting a stream instead. Video wise that should be easy, but audio I'm not sure if it can be done as currently it converts it to a melspectrogram and then passes it to Wav2Lip and that goes beyond my knowledge, I can certainly look into it though!

I've just realised that I think I can speed up the "Improved" quality because that still uses dlib for tracking the mouth and I could look into other upscalers that may be faster than GFPGAN to make a faster "Enhanced" quality.

gt2ming commented 7 months ago

Hi,anothermartz, thank you for your reply.

Actually, I've already implemented a fake real-time, but it didn't work very well.

First, I receive audio by microphone, and convert audio to text.

Next, I get the answer by LLM, just like ChatGPT.

Next, I convert the answer to audio.

Finally, I use easy-wav2lip to generate video, and use 'python-hls' to push video to my chrome.

It took a lot of time to generate the video, make me very sad......hahahaha

Look forward to your improvement, thanks.

anothermartz commented 7 months ago

Why are you converting voice to text then back to voice?

Edit: I realised you meant the answer and your voice is the question!

Echolink50 commented 7 months ago

The project is already very fast for the mouth swap. A speed up for the Upscaler/Face Restoration would be much more interesting since that step takes a very long time. From my use with SD I would say GFPGan gives a more consistent result than something like codeformer when it when it comes multiple frames. Not sure what upscaler you are using but for SD Lanczos was the fastest. If you limit the restoration to just the mouth region maybe a faster restoration model would work. Really the main thing that the restoration model needs to fix is the teeth. Thanks for the project and all the hard work.

gt2ming commented 7 months ago

Sorry for the late reply. I had a holiday. hhhhhhh.

I have another question,

If I use my own data to train the model,

can I get faster video generation?

2547881370 commented 6 months ago

加速(并减少一些故障!)归功于这个项目:

https://github.com/devxpy/cog-Wav2Lip

他们使用 mediapipe 而不是 dlib 来跟踪人脸,这要快得多,也更准确。

至于实时,代码需要从获取完整的视频文件和完成的音频文件转换为接受流。视频方面,这应该很容易,但是音频我不确定是否可以完成,因为目前它将其转换为 melspectrogram 然后将其传递给 Wav2Lip,这超出了我的知识范围,我当然可以研究它!

我刚刚意识到,我认为我可以加快“改进”质量的速度,因为它仍然使用 dlib 来跟踪嘴巴,我可以研究其他可能比 GFPGAN 更快的升级器,以产生更快的“增强”质量。

Credit for speedup (and reduces some glitches!) goes to this project:

https://github.com/devxpy/cog-Wav2Lip

They used mediapipe instead of dlib to track the face which is considerably faster and more accurate.

As for real-time, the code would need to be translated from taking a completed video file and completed audio file and accepting a stream instead. Video wise that should be easy, but audio I'm not sure if it can be done as currently it converts it to a melspectrogram and then passes it to Wav2Lip and that goes beyond my knowledge, I can certainly look into it though!

I've just realised that I think I can speed up the "Improved" quality because that still uses dlib for tracking the mouth and I could look into other upscalers that may be faster than GFPGAN to make a faster "Enhanced" quality.

Hello, is there any latest progress in calculating melspectrogram