NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
890 stars 177 forks source link

Training issue with Male voice #39

Open akshay4malik opened 4 years ago

akshay4malik commented 4 years ago

I am trying to train flowtron model with Male voice. After training for about 270,000 steps, the audio generated is very random. Not a single word is getting generated properly. It's not even learning attention. Earlier I tried with LJ speech dataset, after about 170,000 steps of training, audio samples were not so bad, the pronunciation was not up to the mark, But I could understand what was being said. I am attaching the attention plots here. I have the same amount of data as LJ speech. sid0_sigma0 5_attnlayer1 sid0_sigma0 5_attnlayer0

rafaelvalle commented 4 years ago

Training on what language? Did you try warm-starting from the pre-trained model? Trimming silences from the beginning and end of audio files helps with learning attention.

akshay4malik commented 4 years ago

I am doing it for Hindi Language. We have trimmed silences from the beginning and end of the audio files.

rafaelvalle commented 4 years ago

Were you able to train the same data on Tacotron before?

akshay4malik commented 4 years ago

Yes, We were getting good results on Tacotron 2 with the same data. But flowtron offers several new features, so we thought of training on it as well.

rafaelvalle commented 4 years ago

That's great news. Use pre-trained weights from you Tacotron model to warm-startt a Flowtron with a single step of flow. Once the first step of flow has learned to attend, add the second step of flow and train the full model.

akshay4malik commented 4 years ago

Ok, I will try that and will post the response. But what can be the possible reason in case I am directly trying to train flowtron model.

rafaelvalle commented 4 years ago

I think you're trying to learn both steps of flow at the same time. As we describe in our paper, it's easier to train Flowtron and its steps of flow progressively, for example: 1) First train Flowtron with one step of flow until it learns to attend to the text 2) Use this model to warm-start a Flowtron with 2 steps of flow and train the entire model

akshay4malik commented 4 years ago

While training from pretrained Tacotron2 model, there are some issues. I have tried to overcome some of them, But there are properties of Tacotron2 model, which are causing problems. Here is the error which I feel is because of the prenet and post net layers in tacotron 2, which are not present in Flowtron. RuntimeError: Error(s) in loading state_dict for Flowtron: Unexpected key(s) in state_dict: "decoder.prenet.layers.0.linear_layer.weight", "decoder.prenet.layers.1.linear_layer.weight", "decoder.attention_rnn.weight_ih", "decoder.attention_rnn.weight_hh", "decoder.attention_rnn.bias_ih", "decoder.attention_rnn.bias_hh", "decoder.attention_layer.query_layer.linear_layer.weight", "decoder.attention_layer.memory_layer.linear_layer.weight", "decoder.attention_layer.v.linear_layer.weight", "decoder.attention_layer.location_layer.location_conv.conv.weight", "decoder.attention_layer.location_layer.location_dense.linear_layer.weight", "decoder.decoder_rnn.weight_ih", "decoder.decoder_rnn.weight_hh", "decoder.decoder_rnn.bias_ih", "decoder.decoder_rnn.bias_hh", "decoder.linear_projection.linear_layer.weight", "decoder.linear_projection.linear_layer.bias", "decoder.gate_layer.linear_layer.weight", "decoder.gate_layer.linear_layer.bias", "postnet.convolutions.0.0.conv.weight", "postnet.convolutions.0.0.conv.bias", "postnet.convolutions.0.1.weight", "postnet.convolutions.0.1.bias", "postnet.convolutions.0.1.running_mean", "postnet.convolutions.0.1.running_var", "postnet.convolutions.0.1.num_batches_tracked", "postnet.convolutions.1.0.conv.weight", "postnet.convolutions.1.0.conv.bias", "postnet.convolutions.1.1.weight", "postnet.convolutions.1.1.bias", "postnet.convolutions.1.1.running_mean", "postnet.convolutions.1.1.running_var", "postnet.convolutions.1.1.num_batches_tracked", "postnet.convolutions.2.0.conv.weight", "postnet.convolutions.2.0.conv.bias", "postnet.convolutions.2.1.weight", "postnet.convolutions.2.1.bias", "postnet.convolutions.2.1.running_mean", "postnet.convolutions.2.1.running_var", "postnet.convolutions.2.1.num_batches_tracked", "postnet.convolutions.3.0.conv.weight", "postnet.convolutions.3.0.conv.bias", "postnet.convolutions.3.1.weight", "postnet.convolutions.3.1.bias", "postnet.convolutions.3.1.running_mean", "postnet.convolutions.3.1.running_var", "postnet.convolutions.3.1.num_batches_tracked", "postnet.convolutions.4.0.conv.weight", "postnet.convolutions.4.0.conv.bias", "postnet.convolutions.4.1.weight", "postnet.convolutions.4.1.bias", "postnet.convolutions.4.1.running_mean", "postnet.convolutions.4.1.running_var", "postnet.convolutions.4.1.num_batches_tracked", "encoder.convolutions.0.1.running_mean", "encoder.convolutions.0.1.running_var", "encoder.convolutions.0.1.num_batches_tracked", "encoder.convolutions.1.1.running_mean", "encoder.convolutions.1.1.running_var", "encoder.convolutions.1.1.num_batches_tracked", "encoder.convolutions.2.1.running_mean", "encoder.convolutions.2.1.running_var", "encoder.convolutions.2.1.num_batches_tracked".

rafaelvalle commented 4 years ago

These are harmless and expected given that Flowtron does not have these layers. Are you using warmstart_checkpoint_path instead of checkpoint_path?

akshay4malik commented 4 years ago

Yes I am using warmstart_checkpoint_path, and I have changed "include_layers": ["speaker", "encoder", "embedding"] to "include_layers": ["encoder", "embedding"] and using n_flows = 1. As you mentioned these errors are harmless, how can I avoid them and start training.

rafaelvalle commented 4 years ago

If you're using warmstart_checkpoint_path, the loaded state_dict should be filtered and not have the weights you listed.

Can you send the full stack-trace? If the issue is happening here, you might have to save the Tacotron 2 weights as a state_dict instead of loading ['model'].

akshay4malik commented 4 years ago

No, the problem in that function occurs here But this can be solved by removing this if condition as there are no speaker embeddings in Tacotron2 model. The problem occurs in loading optimizer here However, if I replace this line optimizer.load_state_dict(checkpoint_dict['optimizer']) with following lines,

optimizer.state_dict()['param_groups'] = checkpoint_dict['optimizer']['param_groups']
optimizer.state_dict()['state'] = checkpoint_dict['optimizer']['state']

Though I am not sure about this solution. And after all this, the problem comes in this line The error is following: RuntimeError: Error(s) in loading state_dict for Flowtron: Missing key(s) in state_dict: "speaker_embedding.weight", "flows.0.conv.weight", "flows.0.conv.bias", "flows.0.lstm.weight_ih_l0", "flows.0.lstm.weight_hh_l0", "flows.0.lstm.bias_ih_l0", "flows.0.lstm.bias_hh_l0", "flows.0.lstm.weight_ih_l1", "flows.0.lstm.weight_hh_l1", "flows.0.lstm.bias_ih_l1", "flows.0.lstm.bias_hh_l1", "flows.0.attention_lstm.weight_ih_l0", "flows.0.attention_lstm.weight_hh_l0", "flows.0.attention_lstm.bias_ih_l0", "flows.0.attention_lstm.bias_hh_l0", "flows.0.attention_layer.query.linear_layer.weight", "flows.0.attention_layer.key.linear_layer.weight", "flows.0.attention_layer.value.linear_layer.weight", "flows.0.attention_layer.v.linear_layer.weight", "flows.0.dense_layer.layers.0.linear_layer.weight", "flows.0.dense_layer.layers.0.linear_layer.bias", "flows.0.dense_layer.layers.1.linear_layer.weight", "flows.0.dense_layer.layers.1.linear_layer.bias", "flows.0.gate_layer.linear_layer.weight", "flows.0.gate_layer.linear_layer.bias". Unexpected key(s) in state_dict: "decoder.prenet.layers.0.linear_layer.weight", "decoder.prenet.layers.1.linear_layer.weight", "decoder.attention_rnn.weight_ih", "decoder.attention_rnn.weight_hh", "decoder.attention_rnn.bias_ih", "decoder.attention_rnn.bias_hh", "decoder.attention_layer.query_layer.linear_layer.weight", "decoder.attention_layer.memory_layer.linear_layer.weight", "decoder.attention_layer.v.linear_layer.weight", "decoder.attention_layer.location_layer.location_conv.conv.weight", "decoder.attention_layer.location_layer.location_dense.linear_layer.weight", "decoder.decoder_rnn.weight_ih", "decoder.decoder_rnn.weight_hh", "decoder.decoder_rnn.bias_ih", "decoder.decoder_rnn.bias_hh", "decoder.linear_projection.linear_layer.weight", "decoder.linear_projection.linear_layer.bias", "decoder.gate_layer.linear_layer.weight", "decoder.gate_layer.linear_layer.bias", "postnet.convolutions.0.0.conv.weight", "postnet.convolutions.0.0.conv.bias", "postnet.convolutions.0.1.weight", "postnet.convolutions.0.1.bias", "postnet.convolutions.0.1.running_mean", "postnet.convolutions.0.1.running_var", "postnet.convolutions.0.1.num_batches_tracked", "postnet.convolutions.1.0.conv.weight", "postnet.convolutions.1.0.conv.bias", "postnet.convolutions.1.1.weight", "postnet.convolutions.1.1.bias", "postnet.convolutions.1.1.running_mean", "postnet.convolutions.1.1.running_var", "postnet.convolutions.1.1.num_batches_tracked", "postnet.convolutions.2.0.conv.weight", "postnet.convolutions.2.0.conv.bias", "postnet.convolutions.2.1.weight", "postnet.convolutions.2.1.bias", "postnet.convolutions.2.1.running_mean", "postnet.convolutions.2.1.running_var", "postnet.convolutions.2.1.num_batches_tracked", "postnet.convolutions.3.0.conv.weight", "postnet.convolutions.3.0.conv.bias", "postnet.convolutions.3.1.weight", "postnet.convolutions.3.1.bias", "postnet.convolutions.3.1.running_mean", "postnet.convolutions.3.1.running_var", "postnet.convolutions.3.1.num_batches_tracked", "postnet.convolutions.4.0.conv.weight", "postnet.convolutions.4.0.conv.bias", "postnet.convolutions.4.1.weight", "postnet.convolutions.4.1.bias", "postnet.convolutions.4.1.running_mean", "postnet.convolutions.4.1.running_var", "postnet.convolutions.4.1.num_batches_tracked", "encoder.convolutions.0.1.running_mean", "encoder.convolutions.0.1.running_var", "encoder.convolutions.0.1.num_batches_tracked", "encoder.convolutions.1.1.running_mean", "encoder.convolutions.1.1.running_var", "encoder.convolutions.1.1.num_batches_tracked", "encoder.convolutions.2.1.running_mean", "encoder.convolutions.2.1.running_var", "encoder.convolutions.2.1.num_batches_tracked".

rafaelvalle commented 4 years ago

You should pass only warmstart_checkpoint_path, not checkpoint_path. If you pass checkpoint_path, you will execute the wrong method load_checkpoint. As you said, you'll need to comment the speaker embedding check.

akshay4malik commented 4 years ago

I have not added checkpoint_path, below is the config json "train_config": { "output_directory": "outdir", "epochs": 10000000, "learning_rate": 1e-4, "weight_decay": 1e-6, "sigma": 1.0, "iters_per_checkpoint": 5000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "warmStartOnTacotron/gpu5_checkpoint_176000", "with_tensorboard": true, "fp16_run": false

rafaelvalle commented 4 years ago

You mentioned issues when loading the optimizer. The optimizer is only loaded if this condition is satisfied, which then executes load_checkpoint

rafaelvalle commented 4 years ago

The same applies to the error you mentioned seeing here. This function is only executed if you pass checkpoint_path.

akshay4malik commented 4 years ago

I got it, I am sorry, I am a little mistake while giving the "train" command. Just when you pointed it out, I looked at the functions. The training has started, But here are something things that I would like to know: Should I start training with n_flow = 1 for around 100,000 iterations and then start again with warm start with n_flow = 2.

rafaelvalle commented 4 years ago

Does it work when you comment out this and pass a model to warmstart_checkpoint_path?

akshay4malik commented 4 years ago

Does it work when you comment out this and pass a model to warmstart_checkpoint_path?

The error was coming because while running the code, I was giving the checkpoint path. I did not realize until you mentioned this functions will not be called.

rafaelvalle commented 4 years ago

Great! Let us know once you're able to train with the male hindi voice.

akshay4malik commented 4 years ago

Great! Let us know once you're able to train with the male hindi voice.

Sure, I will inform you for sure. But here is something things that I would like to know: Should I start training with n_flow = 1 for around 100,000 iterations and then start again with warm start with n_flow = 2.

rafaelvalle commented 4 years ago

Yes! Train with n_flow=1 until attention starts looking good then use the n_flow=1 to warmstart a model with n_flow=2, including all weights from n_flow=1. If include_layers=None it will include all weights, as you can see here.

akshay4malik commented 4 years ago

Yes! Train with n_flow=1 until attention starts looking good then use the n_flow=1 to warmstart a model with n_flow=2, including all weights from n_flow=1. If include_layers=None it will include all weights, as you can see here.

Sure! Thanks a lot for all the help.

akshay4malik commented 4 years ago

@rafaelvalle I have one more query, How important is CMUDict while training Flowtron model. As for Hindi language, it is not available, so I have bypassed it. How will it affect the results? However, Tacotron2 model does not face any issue when we bypass the CMUDict.

rafaelvalle commented 4 years ago

It should not be an issue given that in Hindi there's a one to one correspondence between graphemes and phonemes.

akashicMarga commented 4 years ago

@rafaelvalle I have a similar setup for Hindi, I have trained for about 230k steps but the attention is not aligning. I have a trained tacotron model for Hindi which is working very well. I have used its weight for warm starting flowtron. @akshay4malik did you get good results?

akshay4malik commented 4 years ago

@singhaki Yes, I got the attention and the speech generated is fair as well. Instead of warm starting it on tacotron 2 model , try with LJS Specch pretrained model, Which is available publically. And you will have to wait a little longer than 230K steps. On flow-1, you will start getting attention around 0.5 M steps.

rafaelvalle commented 4 years ago

@akshay4malik were you able to train the model with 2 steps of flow by warm-starting from the model with 1 step of flow you trained on your data?

akshay4malik commented 4 years ago

@rafaelvalle Yes, the training for step 2 is going on, Though in tensorboard, I am getting good attention plot for second step as well, But the audio generated is not good yet. I hope it will improve on further training.

akashicMarga commented 4 years ago

losses_500k attention_500k

My loss curves and attention looks like this post-training for 500k steps.should I decrease the learning rate? Any suggestion @akshay4malik @rafaelvalle as audio generated by model is gibberish

rafaelvalle commented 4 years ago

@akshay4malik it should improve with time. share with us a sample, training and validation loss and attention maps if you can.

rafaelvalle commented 4 years ago

@singhaki your validation loss is going up and your model is overfitting. how do the attention maps look before the validation loss starts going up?

akashicMarga commented 4 years ago

@rafaelvalle yes it starts overfitting after 50k steps. Attention is nearly same throughout the training. I tried to synthesise audio from 50k checkpoint but it was gibberish

astricks commented 4 years ago

Hi @rafaelvalle @singhaki @akshay4malik. First off, stellar research as usual, @rafaelvalle, and thanks for sharing all the quality code.

I just started training my Hindi speaker model and fortunately stumbled upon this very helpful thread.

I'm using a 25 hour single speaker male dataset. It's clear of silences, 22050, 16bit PCM - works well with waveglow, have not tried training Tacotron.

Couple of questions

  1. For hindi speech transcriptions, I'm wondering if it is better to use devanagari letters, or unidecode transliterations using the english character set? i'm currently considering using the character set below. _letters = 'अआइईउऊऋएऐऑओऔकखगघचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा'

  2. If you used the devanagari script (and not english transliterations) a. Could you please share the symbols.py you used? b. would warmstarting using flowtron_ljs.py be futile, since the text would be completely different?

akshay4malik commented 4 years ago

@astricks The character set you have chosen is good enough. You can use it, I am doing the same. And instead of starting from scratch, warm start it on flowtron_ljs.pt. Instead of starting your training randomly, it's better to start from somewhere :) .

astricks commented 4 years ago

@akshay4malik Thanks for the clarification! I'll start training and post back results!

rafaelvalle commented 4 years ago

@singhaki please confirm you're training just 1 step of flow and that your model is able to learn attention. the plots you shared show otherwise.

akashicMarga commented 4 years ago

@rafaelvalle I was warm starting from ljs pretrained model with n_flow =1 and model was not able to learn attention and starts overfitting at 50k steps, is it due to data that I have? 4.7 hours is enough? as my tacotron model is working pretty well on the same data.

akshay4malik commented 4 years ago

@rafaelvalle My model got trained on flow-1 properly, generating decent audios. But it's not attending on flow-2. The Audio generated shows very strange behavior. The audio remains good at the beginning and at the end. But in the middle, it plays as some audio tape got stuck. sid0_sigma0 5_attnlayer1 sid0_sigma0 5_attnlayer0

While warm starting training for second flow, I put include_layers = null. Do you see any obvious reason for this kind of behavior?

astricks commented 4 years ago

@akshay4malik Good to know your model attends on a hindi dataset! Could you please share with me how many iterations it took for your model to learn attention? Did you warmstart off of a tacotron model? How did you anneal the learning rate?

I've been trying for a while to get attention on flow-1, and not quite succeeding. The attention graph looks very faint even after 500k iterations, unlike your attention plot which seems very crisp and clear. Tried warmstarting off both LJS and tacotron.

I just cleaned my dataset again (discovered data had some silences at the end) and am trying to warmstart off my tacotron model again.

akshay4malik commented 4 years ago

@astricks I warm started on LJS Model with default settings. It took around 500 K iterations for attaining attention.

raikarsagar commented 4 years ago

Hi, I am training flow-1 for hindi data (about 10hrs) by warmstarting from tacotron2 model with the text embeddings. Even after 280k steps there is no sign of attention being learnt. Found this thread very much relevant for this case. I have set the learning rate to 9e-5 after looking at some git issues and there is no much silence at the beginning of audio files.

Few questions regarding the same: @singhaki were u able to get attention? if so after how many steps approx. ? @akshay4malik Can u comment on the attention plots shown below ? It would be great if u could share the intermediate attention plots of the model. @astricks In #53 you reported that warm-starting from LJS didnt work. Could you please update the latest model config details wch u have got working?

Validation loss- image

Attention plots- flowtron_attn

Thanks in advance sagar

astricks commented 4 years ago

@raikarsagar For me, what worked was once again cleaning my data to remove even the smallest leading/trailing silences; and also rechecking all my text transcriptions. Good data will lead to good training and attention, bad data will not. I'll also add that I have not used less than 20 hours of data for training, curious to see if 10 hours is enough to gain attention.

Syed044 commented 3 years ago

Hi @rafaelvalle @singhaki @akshay4malik. First off, stellar research as usual, @rafaelvalle, and thanks for sharing all the quality code.

I just started training my Hindi speaker model and fortunately stumbled upon this very helpful thread.

I'm using a 25 hour single speaker male dataset. It's clear of silences, 22050, 16bit PCM - works well with waveglow, have not tried training Tacotron.

Couple of questions

  1. For hindi speech transcriptions, I'm wondering if it is better to use devanagari letters, or unidecode transliterations using the english character set? i'm currently considering using the character set below. _letters = 'अआइईउऊऋएऐऑओऔकखगघचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा'
  2. If you used the devanagari script (and not english transliterations) a. Could you please share the symbols.py you used? b. would warmstarting using flowtron_ljs.py be futile, since the text would be completely different?

Hi Astircks, akshay4malik,

I m trying Hindi Dataset of my own, need help with the symbols.py and hindi cleaner. it is posible to share? Currently I m using roman english to type hindi in english its very TDS process, if possible please share the things required to use hindi text straight and what all changes required for the same.

I really appreciate your inputs.

raikarsagar commented 3 years ago

Hi, I had used IITM common label set which can be used for multiple indian langs like Hindi, Telugu, Tamil , Gujarati etc There is a c based parser avl but python wrapper can be implemented for it

-sagar

On Sun, Jun 6, 2021, 14:16 Syed044 @.***> wrote:

Hi @rafaelvalle https://github.com/rafaelvalle @singhaki https://github.com/singhaki @akshay4malik https://github.com/akshay4malik. First off, stellar research as usual, @rafaelvalle https://github.com/rafaelvalle, and thanks for sharing all the quality code.

I just started training my Hindi speaker model and fortunately stumbled upon this very helpful thread.

I'm using a 25 hour single speaker male dataset. It's clear of silences, 22050, 16bit PCM - works well with waveglow, have not tried training Tacotron.

Couple of questions

  1. For hindi speech transcriptions, I'm wondering if it is better to use devanagari letters, or unidecode transliterations using the english character set? i'm currently considering using the character set below. _letters = 'अआइईउऊऋएऐऑओऔकखगघचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा'
  2. If you used the devanagari script (and not english transliterations) a. Could you please share the symbols.py you used? b. would warmstarting using flowtron_ljs.py be futile, since the text would be completely different?

Hi Astircks, akshay4malik,

I m trying Hindi Dataset of my own, need help with the symbols.py and hindi cleaner. it is posible to share? Currently I m using roman english to type hindi in english its very TDS process, if possible please share the things required to use hindi text straight and what all changes required for the same.

I really appreciate your inputs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/flowtron/issues/39#issuecomment-855362992, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFLTJI4MGKN3FID7WE5JPADTRMYVXANCNFSM4OK7WFJA .

Syed044 commented 3 years ago

is it possible to share the files? I m a newbie, still struggling to fully understand. Sagar if you can or anyone who used hindi language please share the file which you used.

Regards, Sid

Syed044 commented 3 years ago

Hi, I had used IITM common label set which can be used for multiple indian langs like Hindi, Telugu, Tamil , Gujarati etc There is a c based parser avl but python wrapper can be implemented for it -sagar On Sun, Jun 6, 2021, 14:16 Syed044 @.***> wrote: Hi @rafaelvalle https://github.com/rafaelvalle @singhaki https://github.com/singhaki @akshay4malik https://github.com/akshay4malik. First off, stellar research as usual, @rafaelvalle https://github.com/rafaelvalle, and thanks for sharing all the quality code. I just started training my Hindi speaker model and fortunately stumbled upon this very helpful thread. I'm using a 25 hour single speaker male dataset. It's clear of silences, 22050, 16bit PCM - works well with waveglow, have not tried training Tacotron. Couple of questions 1. For hindi speech transcriptions, I'm wondering if it is better to use devanagari letters, or unidecode transliterations using the english character set? i'm currently considering using the character set below. _letters = 'अआइईउऊऋएऐऑओऔकखगघचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा' 2. If you used the devanagari script (and not english transliterations) a. Could you please share the symbols.py you used? b. would warmstarting using flowtron_ljs.py be futile, since the text would be completely different? Hi Astircks, akshay4malik, I m trying Hindi Dataset of my own, need help with the symbols.py and hindi cleaner. it is posible to share? Currently I m using roman english to type hindi in english its very TDS process, if possible please share the things required to use hindi text straight and what all changes required for the same. I really appreciate your inputs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFLTJI4MGKN3FID7WE5JPADTRMYVXANCNFSM4OK7WFJA .

As I understand I should update three files in text folder, symbol.py, cmudict.py and cleaner.py.

if you could share the files it will be great help, as I understand I can use hindi text with clean audio to train my model. if thats the only thing which needs to updated and you guys did train few models of your own it will be great help to share it.

Regards, Sid