About my own audio and text in DiffuseStyleGesture+

YoungSeng / DiffuseStyleGesture

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models (IJCAI 2023) | The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 (ICMI 2023, Reproducibility Award)

MIT License

147 stars 21 forks source link

Thank you for your interest in this work. I think this is an obvious and normal result:

the actual generation is the result of multiple modalities, including speech, seed gestures, speaker ID (style), etc. in addition to text;
whether the gesture corresponding to the text "Big" in the original dataset is obvious or not will also lead to whether the model can learn this information or not.
if the above two conditions are met and still can not be reproduced, perhaps we can try the text " Big big big big big ....".

But actually say "big" and do very "big" gesture is only a possibility, human gestures are very diverse, just like people say "1" will not always than a finger, this is more of a theoretical analysis. I hope this helps you.

YoungSeng / DiffuseStyleGesture

About my own audio and text in DiffuseStyleGesture+ #28