Open tshmak opened 4 years ago
Hey Tim,
If the model does not learn to attend, it will produce waves that sound like a foreign language (or really just garbled up English). It sounds like your config.json file may not be setup correctly to train from scratch. Several changes need to be made to the config so that you can accomplish this. Check out #39 where @rafaelvalle describes how you can train from scratch or warmstart models.
setting the attention prior to true will certainly help the model to learn attention. https://github.com/NVIDIA/flowtron/blob/master/config.json#L34
please take a look at the readme
I have the following attention plots and graphs.
Would we consider this good enough to turn attention prior off and continue training, adjusting learning rate as the curves plateau?
these look quite good aside from the unexpected curve around 500 frames. you should be able to turn the attention prior off and adjust learning rate, etc was this model trained with the ctc loss?
Ok I’ll give it a try with attention prior off. Would you increase the batch size as wel?
Crc loss is set “true”, ctc loss weight = 0.1
do you have a recommendation on this?
increasing batch size shouldn't cause any issues. we're very glad to see your extremely sharp attentions. it's a consequence of the ctc loss we recently added.
I thought the attentions looked good, much sharper and brighter than other posts I’ve seen. Despite this though, the speech results at this point still sound bad. Is this quite normal at this point?
I originally trained the LS speech set from scratch and was getting good speech by this point. My dataset is much larger than the LS one but I think the transcriptions might not be 100% good. I think the probability of success comes down to the quality of the dataset, hopefully my model gets there with a few hundred-thousand more iterations.
you need to resume training without the attention prior such that you can perform inference. in fact, you possibly could've resumed training without prior at an earlier iteration. e.g. 200k
The attention plots looked like this throughout the training.
I didn’t know you had to train with attention prior turned off after for it to produce good speech! It’s interesting though, the speech quality did improve through the training with attention prior turned on.
did you use the defaults on https://github.com/NVIDIA/flowtron/blob/master/config.json ?
Mostly those defaults. Learning rate was 10e-4 and changed include_layers as I was doing a warm start from your pretrained model.
A lower learning rate or lower ctc loss weight should work better. Since it is warmstarting from pretrained checkpoint, a lower lr like 1e-4 works fine (with both ctc loss weight 0.1 or 0.01). The parameters in default config work fine for training from scratch.
Now i have continued with training with attention prior = 0 my attention plots are starting to look off.
config looks like this:
{ "train_config": { "output_directory": "/media/andy/Untitled/flowtron3/outdir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 8, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.1, "blank_logprob": -8, "ctc_loss_start_iter": 1000 }, "data_config": { "training_files": "filelists/ljs_audiopaths_text_sid_train_filelist.txt", "validation_files": "filelists/ljs_audiopaths_text_sid_val_filelist.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
If you warmstarted with a very high lr like you were saying you used lr=10e-3, then the attention module can be broken and even removal of prior will not help. Warmstarting from pretrained checkpoints with a reasonable lr of 1e-4 should work fine.
When I started with warm start the learning rate was 1e-4 and number of flows was 1, I then changed to 2 when the attention looked good. I kept the learning rate the same until the attention looked good with number of flows = 2.
I then turned attention prior off, set learning rate to 1e-5, and set batch num to 8. Then you can see from the graphs above that attention is starting to look off as the decoder time step is increasing.
I am feeling rather stuck now. I've trained to 765k iterations and I am still getting nonsense sentences during inference. the plots look ok after inference, although they do differ when you type different phrases in:
As you can see i have reduced the learning rate as the validation loss plateaus and attention plots still look good. can anyone make any suggestions? Does adding more steps of flow actually help? (greater than 2 that is)
hmmmm...I think I see the problem -- if you look at the gate loss, there is a spike around 575k (which I'm guessing the point when you removed the prior? If yes, its expected that when you remove the prior gate loss will increase). I think the problem is that the gate isnt working fine, if gate isnt working properly...then the first step of flow during inference wont have a proper output due to which the next step of flow will receive inaccurate inputs and the whole output will collapse. The reason I think the gate isn't working fine is due to the high validation gate loss (or overfitted gate). What happens when you choose the model at 600k (where gate loss hasnt overfitted too badly?)
I'm training the flowtron model from scratch on the LJSpeech dataset. It seems to run ok. However, after nearly three days, the attention matrix still has the following form and the resulting generated speech resembles speech from a foreign language.
Can anyone provide me with some intuition on what may be going wrong?
Thanks,
Tim