Output pose length does not match the input audio

UttaranB127 / speech2affective_gestures

This is the official implementation of the paper "Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning".

https://gamma.umd.edu/s2ag/

MIT License

44 stars 9 forks source link

Output pose length does not match the input audio #9

Closed cjerry1243 closed 2 years ago

cjerry1243 commented 2 years ago

Hi,

I am wondering why we have to divide the word and audio into several sequences during inference. That somehow results in shorter output pose than input audio. Is there a way to fix that?

UttaranB127 commented 2 years ago

Hi, we did the subdivision following prior work, to keep our comparisons fair. As a result, if the clip length is not an exact multiple of the subdivision length, the final portion is dropped from the generation pipeline. You can change this behavior by changing the value of unit_time. I have added unit_time as a parameter in the method render_clip inside processor_v2.py to make this easier.

cjerry1243 commented 2 years ago

How should I set the unit_time? the same as len(clip_audio) / 16000? The audio and text feature lengths do not match then if I change the unit_time.

UttaranB127 commented 2 years ago

Ah nvm, it was a much simpler fix, I only needed to add the remaining portion of the clip as an extra subdivision. I've uploaded the revised code.

cjerry1243 commented 2 years ago

I managed to fix them. Thank you.