NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

Training on multiple languages for multiple speakers #84

Open trueProgrammer opened 3 years ago

trueProgrammer commented 3 years ago

First of all, thank you for releasing the code and making this fantastic paper!

I have read the instructions and all the issues and after that, I started to train the model. I used an Ljs dataset and a single male Russian speaker with about 40h of good clean speech. My steps were:

  1. Add Russian cmu dictionary and update symbols, acronyms, cleaners and updated n_text to 279 and n_speakers to 2;
  2. Train from scratch with n_flow=1 for about 500k steps
Screen Shot 2020-10-23 at 12 38 12 AM Screen Shot 2020-10-23 at 12 38 16 AM Screen Shot 2020-10-23 at 12 38 59 AM
  1. Warm start with n_flow=2 and include_layers is None for next 300k steps Screen Shot 2020-10-23 at 12 41 47 AM
Screen Shot 2020-10-23 at 12 41 44 AM Screen Shot 2020-10-23 at 12 42 05 AM Screen Shot 2020-10-23 at 12 41 59 AM Screen Shot 2020-10-23 at 12 42 16 AM

Even I got promising graphics the result of the audio is awful because it sounds like another language and you can't understand any word at all. I have a few questions:

  1. Is it possible to train flowtron for two languages simultaneously?
  2. If so do I need to use the Russian cmu dictionary or it is better to go without arpabet?
  3. Any thoughts on why I'm getting completely illegible speech even I used sentences from training and different frames, sigma, gate parameters?

Thank you in advance!

rafaelvalle commented 3 years ago

it should be fine to train multiple languages simultaneously and it's better to use the dictionary of your target language. make sure the model has the correct set of token embeddings and that the dataloader is using the right token ids for each language.

the issue you're observing might be related to wrong inputs.