Is there a plan to include Talking Avatar that is Audio Driven in LivePortrait?

oisilener1982 commented 4 months ago

Just wondering if there is any hope of having this project be used to create talking avatar that is audio driven. Im having fun with this proect but it would be nice to have talking heads

zzzweakman commented 4 months ago

Thank you for your interest! You can check some details about the audio driven control in the supplementary materials of our paper, where we have included the relevant experimental results. @oisilener1982

oisilener1982 commented 4 months ago

Is it available right now or if not any estimated date of release? Im having fun with Liveportrait. It is so fast unlike other projects

oisilener1982 commented 4 months ago

I scanned the PDF paper and i cant find audio driven control of the face just like in sadtalker or hedra wherein we just input the image and audio then generate a talking avatar

nitinmukesh commented 4 months ago

Please build audio driven talking avatar.

Inferencer commented 4 months ago

Could use an audio to 3dmm result as driver or another lipsync tool

zzzweakman commented 4 months ago

The experiment results can be found in appendix.C of the paper.

zzzweakman commented 4 months ago

Due to some limitations, we are sorry that we are unable to provide this model. But you can follow the description in the appendix.C to train an audio driven model by yourself :) @nitinmukesh @oisilener1982

oisilener1982 commented 4 months ago

i am just an ordinary user :( I only learned something by following the tutorials from youtube (newgenai). I might just subscribe to Hedra and combine it with sadtalker but it would be nice if there would be a talking avatar because this project is really fast. Even faster than sadtalker

Is it just now or there is really no possibility of having a talking head like sadtalker or hedra?

or will this be another project? C. Audio-driven Portrait Animation We can easily extend our video-driven model to audio-driven portrait animation by regressing or generating motions, including expression deformations and head poses, from audio inputs. For instance, we use Whisper [58] to encode audio into sequential features and adopt a transformer-based framework, following FaceFormer [59], to autoregress the motions

Bubarinokk commented 4 months ago

Where?

Inferencer commented 4 months ago

could use my repo lipsick or dinet might be better for this, or wait for a expressive 3dmm like media2face

tonyabracadabra commented 4 months ago

Is media2face real time or near real time like live portrait? If so we can build the pipeline much easier

Inferencer commented 4 months ago

cant remember that paper was a while ago, the issue is getting a good one with a license that you need it for but generally they are fast

tonyabracadabra commented 4 months ago

Was codetalker the sota earlier on audio to 3DMM? https://github.com/Doubiiu/CodeTalker we may try that too. Ultimately I’m waiting for something like vasa-1.

Inferencer commented 4 months ago

I've kept a distant eye on 3dmm's and watched the project demo's and starred every-time I found one but it's only from today I am looking at whats available with a good license, although I'm still keeping an eye on emotional lip-sync papers to drive they just don't seem to have a good enough audio to lip fidelity, are you on my Discord inbox Tony? I see you did the replicate for Lipsick we might be doing the same thing here we should talk just in case Discord: Inferencer I have sourced an audio model with multi language support to drive liveportrait but it uses hubert which has a bad license, I don't like deepspeech either, its ok for american male spoken words but not much else https://github.com/user-attachments/assets/70f9ff50-8105-4d29-99c7-62b0b31f46af

taichuai commented 3 months ago

Hello @Inferencer , From the audio driver sample you provided, the result is quite good. May I ask what features of the liveportrait model you are using as the prediction target for the audio？

torphix commented 3 months ago

@zzzweakman Hi thanks for amazing work. For audio driven model is the inputs and targets the expression output of the motion encoder and the yaw, pitch and roll angles? do you include the template like in faceformer and if so do you set the template to be the src image or is the template simply the first value in the sequence of expressions / angles? Are you processing the expression tensors in any way eg: scaling them before prediction?

Thank you kindly

nitinmukesh commented 3 months ago

This is really amazing. How did you created this, please share with us

taichuai commented 3 months ago

@zzzweakman Hello, I have some questions about audio-driven. Can you please give me some advice? From paper content: Unlike pose, we cannot explicitly control the expressions, but rather need a combination of these implicit blendshapes to achieve the desired effects.

so, if i want to train an audio driven model , obviously with audio as input, but what can be used as target? Directly use expressions δ from motion extractor or retargeted offset x -> (x = x + △)?

markson14 commented 3 months ago

I've tried to auto-regress exp and angles(with and without scale, t). It does not generate normal result. In fact, the driving kpts are totally distorted. I wonder am I missing something here.

TZYSJTU commented 2 months ago

I think he is just using a real video to drive liveportrait, then merge the audio of the real video into the generated one