NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

Output length is fixed? #120

Closed andi-808 closed 3 years ago

andi-808 commented 3 years ago

Hello,

I am in the process of training from a pre-trained model (no success so far). Running inference on some of the models I’ve produced produces sound. It doesn’t matter how long my desired sentence is I wish the model to speak, the length of the sound clip produced is always 5 seconds, 410KB.

Is there something I’m missing? Is it because my models are currently garbage and won’t produce the correct output until good attention is achieved? The voice tone/timbre sounds correct, albeit gibberish.

andi-808 commented 3 years ago

Ok, so the file size now changes now that I’ve properly trained with my data. It took a while but I’m getting speech out with my dataset. However, there is a maximum limit for the length of the utterance.

Do I warm start with “n_text” set to a larger value? I tried this but got an error before it even started training.

andi-808 commented 3 years ago

I managed to get the length to vary during inference.

Bahm9919 commented 2 years ago

I managed to get the length to vary during inference.

Can you share, how could you do that?

andi-808 commented 2 years ago

I managed to get the length to vary during inference.

Can you share, how could you do that?

Hey, the output was fixed for as long as it was still making progress training. At the point where it looked like it was over-fitting, I stopped and reduced the learning rate. Once I managed to find the minimum, actual legible vocabulary would be produced. As it was getting closer, the output length would vary, getting closer and closer the more it learned.

Bahm9919 commented 2 years ago

I managed to get the length to vary during inference.

Can you share, how could you do that?

Hey, the output was fixed for as long as it was still making progress training. At the point where it looked like it was over-fitting, I stopped and reduced the learning rate. Once I managed to find the minimum, actual legible vocabulary would be produced. As it was getting closer, the output length would vary, getting closer and closer the more it learned.

Thanks for your reply and explanation. Need some revision, do you mean that continue training with reducing learning rate and finding local minumum will increase lenght of pronouncible output?