Closed cjerry1243 closed 2 years ago
Hi, we did the subdivision following prior work, to keep our comparisons fair. As a result, if the clip length is not an exact multiple of the subdivision length, the final portion is dropped from the generation pipeline. You can change this behavior by changing the value of unit_time
. I have added unit_time
as a parameter in the method render_clip
inside processor_v2.py
to make this easier.
How should I set the unit_time? the same as len(clip_audio) / 16000? The audio and text feature lengths do not match then if I change the unit_time.
Ah nvm, it was a much simpler fix, I only needed to add the remaining portion of the clip as an extra subdivision. I've uploaded the revised code.
I managed to fix them. Thank you.
Hi,
I am wondering why we have to divide the word and audio into several sequences during inference. That somehow results in shorter output pose than input audio. Is there a way to fix that?