NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
889 stars 177 forks source link

Strange attention_weights matrix #60

Open tshmak opened 4 years ago

tshmak commented 4 years ago

I'm training the flowtron model from scratch on the LJSpeech dataset. It seems to run ok. However, after nearly three days, the attention matrix still has the following form and the resulting generated speech resembles speech from a foreign language. attention_weight_0

Can anyone provide me with some intuition on what may be going wrong?

Thanks,

Tim

stephenmelsom commented 4 years ago

Hey Tim,

If the model does not learn to attend, it will produce waves that sound like a foreign language (or really just garbled up English). It sounds like your config.json file may not be setup correctly to train from scratch. Several changes need to be made to the config so that you can accomplish this. Check out #39 where @rafaelvalle describes how you can train from scratch or warmstart models.

rafaelvalle commented 3 years ago

setting the attention prior to true will certainly help the model to learn attention. https://github.com/NVIDIA/flowtron/blob/master/config.json#L34

please take a look at the readme

andi-808 commented 3 years ago

I have the following attention plots and graphs.

Would we consider this good enough to turn attention prior off and continue training, adjusting learning rate as the curves plateau?

754FEC17-74F7-4AC4-A87B-173C7ABF0104 2C3E740A-2FDD-41BE-B724-A20700E5CE98 0C381D74-993B-44FC-826E-497C9F2678C3

rafaelvalle commented 3 years ago

these look quite good aside from the unexpected curve around 500 frames. you should be able to turn the attention prior off and adjust learning rate, etc was this model trained with the ctc loss?

andi-808 commented 3 years ago

Ok I’ll give it a try with attention prior off. Would you increase the batch size as wel?

Crc loss is set “true”, ctc loss weight = 0.1

do you have a recommendation on this?

rafaelvalle commented 3 years ago

increasing batch size shouldn't cause any issues. we're very glad to see your extremely sharp attentions. it's a consequence of the ctc loss we recently added.

andi-808 commented 3 years ago

I thought the attentions looked good, much sharper and brighter than other posts I’ve seen. Despite this though, the speech results at this point still sound bad. Is this quite normal at this point?

I originally trained the LS speech set from scratch and was getting good speech by this point. My dataset is much larger than the LS one but I think the transcriptions might not be 100% good. I think the probability of success comes down to the quality of the dataset, hopefully my model gets there with a few hundred-thousand more iterations.

rafaelvalle commented 3 years ago

you need to resume training without the attention prior such that you can perform inference. in fact, you possibly could've resumed training without prior at an earlier iteration. e.g. 200k

andi-808 commented 3 years ago

The attention plots looked like this throughout the training. 2AE808F8-FFF8-43D4-BB4F-983B27FC06FD

I didn’t know you had to train with attention prior turned off after for it to produce good speech! It’s interesting though, the speech quality did improve through the training with attention prior turned on.

rafaelvalle commented 3 years ago

did you use the defaults on https://github.com/NVIDIA/flowtron/blob/master/config.json ?

andi-808 commented 3 years ago

Mostly those defaults. Learning rate was 10e-4 and changed include_layers as I was doing a warm start from your pretrained model.

rohanbadlani commented 3 years ago

A lower learning rate or lower ctc loss weight should work better. Since it is warmstarting from pretrained checkpoint, a lower lr like 1e-4 works fine (with both ctc loss weight 0.1 or 0.01). The parameters in default config work fine for training from scratch.

andi-808 commented 3 years ago

Now i have continued with training with attention prior = 0 my attention plots are starting to look off. Screenshot from 2021-05-19 09-45-36 Screenshot from 2021-05-19 09-45-57 Screenshot from 2021-05-19 09-46-16 Screenshot from 2021-05-19 09-46-27

config looks like this:

{ "train_config": { "output_directory": "/media/andy/Untitled/flowtron3/outdir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 8, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.1, "blank_logprob": -8, "ctc_loss_start_iter": 1000 }, "data_config": { "training_files": "filelists/ljs_audiopaths_text_sid_train_filelist.txt", "validation_files": "filelists/ljs_audiopaths_text_sid_val_filelist.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }

rohanbadlani commented 3 years ago

If you warmstarted with a very high lr like you were saying you used lr=10e-3, then the attention module can be broken and even removal of prior will not help. Warmstarting from pretrained checkpoints with a reasonable lr of 1e-4 should work fine.

andi-808 commented 3 years ago

When I started with warm start the learning rate was 1e-4 and number of flows was 1, I then changed to 2 when the attention looked good. I kept the learning rate the same until the attention looked good with number of flows = 2.

I then turned attention prior off, set learning rate to 1e-5, and set batch num to 8. Then you can see from the graphs above that attention is starting to look off as the decoder time step is increasing.

andi-808 commented 3 years ago

I am feeling rather stuck now. I've trained to 765k iterations and I am still getting nonsense sentences during inference. the plots look ok after inference, although they do differ when you type different phrases in: sid0_sigma0 5_attnlayer0 sid0_sigma0 5_attnlayer1

As you can see i have reduced the learning rate as the validation loss plateaus and attention plots still look good. can anyone make any suggestions? Does adding more steps of flow actually help? (greater than 2 that is)

Screenshot from 2021-05-20 12-17-18 Screenshot from 2021-05-20 12-17-35

rohanbadlani commented 3 years ago

hmmmm...I think I see the problem -- if you look at the gate loss, there is a spike around 575k (which I'm guessing the point when you removed the prior? If yes, its expected that when you remove the prior gate loss will increase). I think the problem is that the gate isnt working fine, if gate isnt working properly...then the first step of flow during inference wont have a proper output due to which the next step of flow will receive inaccurate inputs and the whole output will collapse. The reason I think the gate isn't working fine is due to the high validation gate loss (or overfitted gate). What happens when you choose the model at 600k (where gate loss hasnt overfitted too badly?)