The result is not very good when the input is Chinese audio

YoungSeng / DiffuseStyleGesture

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models (IJCAI 2023) | The DiffuseStyleGesture+ entry to the GENEA Challenge 2023 (ICMI 2023, Reproducibility Award)

MIT License

147 stars 21 forks source link

The result is not very good when the input is Chinese audio #11

Closed jiaqiAA closed 1 year ago

jiaqiAA commented 1 year ago

Hi,

When I input Chinese audio, the beat does not correspond to the body movements very well. Do you know why? Do I need to train the model with Chinese datasets?

YoungSeng commented 1 year ago

My suggestion is to train with Chinese data because ZEGGS is only in English;
the features are WavLM representations, which may not work well for Chinese;
ZEGGS is a female dataset, may be you may decrease the effect by using male voice (no matter Chinese or English).

Good luck.

jiaqiAA commented 1 year ago

Thank you for your reply, I just tried the male voice, the effect is not very good

jiaqiAA commented 1 year ago

Hi, I want to train the model with my own data, and the number of bones is 23. Zeggs's is 75, and MDM njoints=1141, I want to know how to calculate njoints.

When I use my own data, MDM njoints=361 worked.

Thanks!

YoungSeng commented 1 year ago

It depends on what the features of gesture your used, as https://github.com/YoungSeng/DiffuseStyleGesture/blob/d796b3910d5e6bae9918b0b564d94f6110ffff5b/main/process/process_zeggs_bvh.py#L214

https://github.com/YoungSeng/DiffuseStyleGesture/blob/d796b3910d5e6bae9918b0b564d94f6110ffff5b/BEAT-TWH-main/process/process_BEAT_bvh.py#L85

https://github.com/YoungSeng/DiffuseStyleGesture/blob/d796b3910d5e6bae9918b0b564d94f6110ffff5b/BEAT-TWH-main/process/process_TWH_bvh.py#L63

If you use you own motion data, you can use your motion features dimensions.

jiaqiAA commented 1 year ago

Thanks for your reply. I have a question that the number of frames per inference is stride_poses = n_poses - n_seed, for example, if stride_poses=80, and the total number of frames for the input data is 200. It will lose 40 frames of data. Could I fill in zeros at the end of the data to make it 240 frames long, and delete it after inference. Does this affect the previous results? or do you have any other way to solve this problem?

Thanks!

https://github.com/YoungSeng/DiffuseStyleGesture/blob/d796b3910d5e6bae9918b0b564d94f6110ffff5b/main/mydiffusion_zeggs/sample.py#L225

YoungSeng commented 1 year ago

It's strange, there shouldn't be a problem with this, and surely the GENEA Challenge submissions have audio and gestures of the same length, see the code for DiffuseStyleGesture+.

As you said, assuming each segment is 100 (seed frames 40+60) long, for example 200 frames of speech, inference starts with either using 0 to make up 40 frames or picking 40 frames of gesture in the dataset as the initial gesture, which makes 240 frames, and each time inference generates 100 frames using the last 40 frames of the previous segment as the input, i.e., predicting the last 60 frames. Finally, delete the first 0 patch 40 frames or the 40 frames selected in the dataset will be as long as the audio. The last segment (e.g. less than 60 frames) can be handled in any way (discarded, filled with zeros, etc.) and has no effect on the result.