YoungSeng / DiffuseStyleGesture

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models (IJCAI 2023) | The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 (ICMI 2023, Reproducibility Award)
MIT License
147 stars 21 forks source link

About my own audio and text in DiffuseStyleGesture+ #28

Closed Jeremy8080 closed 8 months ago

Jeremy8080 commented 10 months ago

Hi, I tested the engineering of DiffuseStyleGesture and read the corresponding paper. I noticed that your paper mentioned that the text semantics of the model input can have an impact on the generated gestures, such as saying 'big'. But after modifying the input text, I found that the output gesture did not change at all. What is the reason for this

YoungSeng commented 10 months ago

Thank you for your interest in this work. I think this is an obvious and normal result:

  1. the actual generation is the result of multiple modalities, including speech, seed gestures, speaker ID (style), etc. in addition to text;
  2. whether the gesture corresponding to the text "Big" in the original dataset is obvious or not will also lead to whether the model can learn this information or not.
  3. if the above two conditions are met and still can not be reproduced, perhaps we can try the text " Big big big big big ....".

But actually say "big" and do very "big" gesture is only a possibility, human gestures are very diverse, just like people say "1" will not always than a finger, this is more of a theoretical analysis. I hope this helps you.