Closed Muhyzater closed 3 years ago
you need to provide the model with enough frames to generate the sentence. you can either fit a linear model between sentence sentence and normalized text and use it to predict the minimum required number of frames for generating that sentence and add an offset to guarantee that you have more than the expected minimum.
another alternative is to give a rather large number, e.g. n_frames=2000
and rely on the model for removing the excessive frames.
what training setup did you use? did you train with the gate prior and then turn off the gate prior? if you did turn off the gate prior, did you remember to turn it off during inference?
Thank you @rafaelvalle Apologies for the late reply, but re-training took some time.
The issue is now solved by two steps; increasing n_text
to 360
and setting attn_prior
to False and training for some iterations after being trained with attn_prior
as True.
Thank you for your response.
Thank you for the great paper and work.
After training an Arabic Flowtron, it is generating correctly pronounced, good quality audio files with emotion and liveliness.
However, the quality and correctness of the generated audio is highly dependent on the number of frames
n_frames
on inference, meaning that it can move from a noisy file which barely pronounces any word, to a perfect audio with only 10-20 difference in the n_frames provided. Also, if the provided sentence is longer than 5 words, the trained model does not generate a satisfactory audio, however with the right adjustment ofn_frames
, it does pronounce most of the words correctly but with some skipped words.Any idea what the problem could be?
Thank you