KwaiVGI / LivePortrait

Bring portraits to life!
https://liveportrait.github.io
Other
13.07k stars 1.39k forks source link

Is there a plan to include Talking Avatar that is Audio Driven in LivePortrait? #35

Open oisilener1982 opened 4 months ago

oisilener1982 commented 4 months ago

Just wondering if there is any hope of having this project be used to create talking avatar that is audio driven. Im having fun with this proect but it would be nice to have talking heads

zzzweakman commented 4 months ago

Thank you for your interest! You can check some details about the audio driven control in the supplementary materials of our paper, where we have included the relevant experimental results. @oisilener1982

oisilener1982 commented 4 months ago

Is it available right now or if not any estimated date of release? Im having fun with Liveportrait. It is so fast unlike other projects

oisilener1982 commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

nitinmukesh commented 4 months ago

Please build audio driven talking avatar.

Inferencer commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

zzzweakman commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

The experiment results can be found in appendix.C of the paper.

zzzweakman commented 4 months ago

Due to some limitations, we are sorry that we are unable to provide this model. But you can follow the description in the appendix.C to train an audio driven model by yourself :) @nitinmukesh @oisilener1982

oisilener1982 commented 4 months ago

i am just an ordinary user :( I only learned something by following the tutorials from youtube (newgenai). I might just subscribe to Hedra and combine it with sadtalker but it would be nice if there would be a talking avatar because this project is really fast. Even faster than sadtalker

Is it just now or there is really no possibility of having a talking head like sadtalker or hedra?

or will this be another project? C. Audio-driven Portrait Animation We can easily extend our video-driven model to audio-driven portrait animation by regressing or generating motions, including expression deformations and head poses, from audio inputs. For instance, we use Whisper [58] to encode audio into sequential features and adopt a transformer-based framework, following FaceFormer [59], to autoregress the motions

Bubarinokk commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Where?

Inferencer commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Where?

could use my repo lipsick or dinet might be better for this, or wait for a expressive 3dmm like media2face

tonyabracadabra commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Is media2face real time or near real time like live portrait? If so we can build the pipeline much easier

Inferencer commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Is media2face real time or near real time like live portrait? If so we can build the pipeline much easier

cant remember that paper was a while ago, the issue is getting a good one with a license that you need it for but generally they are fast

tonyabracadabra commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Is media2face real time or near real time like live portrait? If so we can build the pipeline much easier

cant remember that paper was a while ago, the issue is getting a good one with a license that you need it for but generally they are fast

Was codetalker the sota earlier on audio to 3DMM? https://github.com/Doubiiu/CodeTalker we may try that too. Ultimately I’m waiting for something like vasa-1.

Inferencer commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Is media2face real time or near real time like live portrait? If so we can build the pipeline much easier

cant remember that paper was a while ago, the issue is getting a good one with a license that you need it for but generally they are fast

Was codetalker the sota earlier on audio to 3DMM? https://github.com/Doubiiu/CodeTalker we may try that too. Ultimately I’m waiting for something like vasa-1.

I've kept a distant eye on 3dmm's and watched the project demo's and starred every-time I found one but it's only from today I am looking at whats available with a good license, although I'm still keeping an eye on emotional lip-sync papers to drive they just don't seem to have a good enough audio to lip fidelity, are you on my Discord inbox Tony? I see you did the replicate for Lipsick we might be doing the same thing here we should talk just in case Discord: Inferencer I have sourced an audio model with multi language support to drive liveportrait but it uses hubert which has a bad license, I don't like deepspeech either, its ok for american male spoken words but not much else https://github.com/user-attachments/assets/70f9ff50-8105-4d29-99c7-62b0b31f46af

taichuai commented 3 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

Could use an audio to 3dmm result as driver or another lipsync tool

Is media2face real time or near real time like live portrait? If so we can build the pipeline much easier

cant remember that paper was a while ago, the issue is getting a good one with a license that you need it for but generally they are fast

Was codetalker the sota earlier on audio to 3DMM? https://github.com/Doubiiu/CodeTalker we may try that too. Ultimately I’m waiting for something like vasa-1.

I've kept a distant eye on 3dmm's and watched the project demo's and starred every-time I found one but it's only from today I am looking at whats available with a good license, although I'm still keeping an eye on emotional lip-sync papers to drive they just don't seem to have a good enough audio to lip fidelity, are you on my Discord inbox Tony? I see you did the replicate for Lipsick we might be doing the same thing here we should talk just in case Discord: Inferencer I have sourced an audio model with multi language support to drive liveportrait but it uses hubert which has a bad license, I don't like deepspeech either, its ok for american male spoken words but not much else https://github.com/user-attachments/assets/70f9ff50-8105-4d29-99c7-62b0b31f46af

Hello @Inferencer , From the audio driver sample you provided, the result is quite good. May I ask what features of the liveportrait model you are using as the prediction target for the audio?

torphix commented 3 months ago

@zzzweakman Hi thanks for amazing work. For audio driven model is the inputs and targets the expression output of the motion encoder and the yaw, pitch and roll angles? do you include the template like in faceformer and if so do you set the template to be the src image or is the template simply the first value in the sequence of expressions / angles? Are you processing the expression tensors in any way eg: scaling them before prediction?

Thank you kindly

nitinmukesh commented 3 months ago

https://github.com/user-attachments/assets/70f9ff50-8105-4d29-99c7-62b0b31f46af

This is really amazing. How did you created this, please share with us

taichuai commented 3 months ago

@zzzweakman Hello, I have some questions about audio-driven. Can you please give me some advice? From paper content: Unlike pose, we cannot explicitly control the expressions, but rather need a combination of these implicit blendshapes to achieve the desired effects.

so, if i want to train an audio driven model , obviously with audio as input, but what can be used as target? Directly use expressions δ from motion extractor or retargeted offset x -> (x = x + △)?

markson14 commented 3 months ago

@zzzweakman Hi thanks for amazing work. For audio driven model is the inputs and targets the expression output of the motion encoder and the yaw, pitch and roll angles? do you include the template like in faceformer and if so do you set the template to be the src image or is the template simply the first value in the sequence of expressions / angles? Are you processing the expression tensors in any way eg: scaling them before prediction?

Thank you kindly

I've tried to auto-regress exp and angles(with and without scale, t). It does not generate normal result. In fact, the driving kpts are totally distorted. I wonder am I missing something here.

TZYSJTU commented 2 months ago

Inferencer_is_amazing.mp4

This is really amazing. How did you created this, please share with us

I think he is just using a real video to drive liveportrait, then merge the audio of the real video into the generated one

ziyaad30 commented 2 months ago

https://github.com/user-attachments/assets/88af8d95-2610-465e-9fff-016a34029d71

I am trying to put together a few models to achieve this, however quality is NOT GREAT. This is wav2lip after being processed by https://github.com/wangsuzhen/Audio2Head/tree/main

Notice there is no disconnect between the head-neck-region and the shoulders as the video above clearly has, and there is no transparent block around the mouth.

I do not create models (NO GPU POWER) so I throw together a bunch of repos and now trying with LivePortrait

ziyaad30 commented 2 months ago

https://github.com/user-attachments/assets/b1951c1e-b4b5-4653-915c-1504470cba6c

Update: Working with this now using LivePortrait, but still far from a GOOD result.

TZYSJTU commented 2 months ago

481cab92-0a78-4da0-98f6-7e4f6572d597.mp4 I am trying to put together a few models to achieve this, however quality is NOT GREAT. This is wav2lip after being processed by https://github.com/wangsuzhen/Audio2Head/tree/main

Notice there is no disconnect between the head-neck-region and the shoulders as the video above clearly has, and there is no transparent block around the mouth.

I do not create models (NO GPU POWER) so I throw together a bunch of repos and now trying with LivePortrait

So you mean this is the result of another method but not liveportrait? This is confusing and not suitable that you put it under this issune

ziyaad30 commented 2 months ago

481cab92-0a78-4da0-98f6-7e4f6572d597.mp4 I am trying to put together a few models to achieve this, however quality is NOT GREAT. This is wav2lip after being processed by https://github.com/wangsuzhen/Audio2Head/tree/main Notice there is no disconnect between the head-neck-region and the shoulders as the video above clearly has, and there is no transparent block around the mouth. I do not create models (NO GPU POWER) so I throw together a bunch of repos and now trying with LivePortrait

So you mean this is the result of another method but not liveportrait? This is confusing and not suitable that you put it under this issune

Yes, that I am now to incorporate with LivePortrait because it's quality is really good, like the update I showed above, that is run through my first (thrown together method) then LivePortrait (clearly seen with quality)

ziyaad30 commented 2 months ago

Inferencer_is_amazing.mp4

This is really amazing. How did you created this, please share with us

I think he is just using a real video to drive liveportrait, then merge the audio of the real video into the generated one

So this too was irrelevant, you could not figure it out as nitinmukesh did

ziyaad30 commented 2 months ago

481cab92-0a78-4da0-98f6-7e4f6572d597.mp4 I am trying to put together a few models to achieve this, however quality is NOT GREAT. This is wav2lip after being processed by https://github.com/wangsuzhen/Audio2Head/tree/main Notice there is no disconnect between the head-neck-region and the shoulders as the video above clearly has, and there is no transparent block around the mouth. I do not create models (NO GPU POWER) so I throw together a bunch of repos and now trying with LivePortrait

So you mean this is the result of another method but not liveportrait? This is confusing and not suitable that you put it under this issune

Audio2Head2Portrait

And there you go using Audio2Head then LivePortrait

https://github.com/user-attachments/assets/4abf16a5-cd95-4df0-9fa7-26cf9352a3a0

linhcentrio commented 1 month ago

481cab92-0a78-4da0-98f6-7e4f6572d597.mp4 I am trying to put together a few models to achieve this, however quality is NOT GREAT. This is wav2lip after being processed by https://github.com/wangsuzhen/Audio2Head/tree/main Notice there is no disconnect between the head-neck-region and the shoulders as the video above clearly has, and there is no transparent block around the mouth. I do not create models (NO GPU POWER) so I throw together a bunch of repos and now trying with LivePortrait

So you mean this is the result of another method but not liveportrait? This is confusing and not suitable that you put it under this issune

Audio2Head2Portrait

And there you go using Audio2Head then LivePortrait

piclumen-1726303886409--pre-video.mp4

Hi @ziyaad30 Can you show me how to combine audio2head with liveportrail, how can i make video same with you?