Inference output quality and correctness is highly dependent on number of frames

NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer

https://nv-adlr.github.io/Flowtron

Apache License 2.0

889 stars 177 forks source link

Inference output quality and correctness is highly dependent on number of frames #110

Closed Muhyzater closed 3 years ago

Muhyzater commented 3 years ago

Thank you for the great paper and work.

After training an Arabic Flowtron, it is generating correctly pronounced, good quality audio files with emotion and liveliness.

However, the quality and correctness of the generated audio is highly dependent on the number of frames n_frames on inference, meaning that it can move from a noisy file which barely pronounces any word, to a perfect audio with only 10-20 difference in the n_frames provided. Also, if the provided sentence is longer than 5 words, the trained model does not generate a satisfactory audio, however with the right adjustment of n_frames, it does pronounce most of the words correctly but with some skipped words.

Any idea what the problem could be?

Thank you

rafaelvalle commented 3 years ago

you need to provide the model with enough frames to generate the sentence. you can either fit a linear model between sentence sentence and normalized text and use it to predict the minimum required number of frames for generating that sentence and add an offset to guarantee that you have more than the expected minimum.

another alternative is to give a rather large number, e.g. n_frames=2000 and rely on the model for removing the excessive frames.

what training setup did you use? did you train with the gate prior and then turn off the gate prior? if you did turn off the gate prior, did you remember to turn it off during inference?

Muhyzater commented 3 years ago

Thank you @rafaelvalle Apologies for the late reply, but re-training took some time.

The issue is now solved by two steps; increasing n_text to 360 and setting attn_prior to False and training for some iterations after being trained with attn_prior as True. Thank you for your response.