Training with preprocessed txt input and mel-spectrogram input

Hi, Thank you for great paper! I've been having problems training a Flowtron model with my own dataset on 8 Tesla V100.

Some information about this dataset:

The text inputs are sequences of ids that each represents a phoneme in a provided dictionary.
The mel-spectrograms are extracted offline with different hyper-parameters from the default ones provied in the config.json file in this repo.
The dataset is in English.
The dataset has only one speaker.
The dataset has around 11k sentences in training set and 130 sentences in validation set.
The maximum frame length is 300.

My problem is that the nll loss starts shaking tremedously after reaching a certain number. I've tried different combinations of learning rate and weight decay, the shaky loss is not improved whatsoever. I'm wondering is this is normal as I didn't see similar situation in the issues in this repo. The loss can go up to over 10 quite often.

The picture of the loss curve

I will also attach the config that I used to train { "train_config": { "output_directory": "output_dir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-7, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 32, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.01, "blank_logprob": -8, "ctc_loss_start_iter": 10000 }, "data_config": { "train_tdd": "train.tdd", "val_tdd": "val.tdd", "mf_dirs": ["mf", "mf_2.0"], "lf_dirs": ["lf", "lf_2.0"], "speaker_format": "label", "speaker_dir": "", "speaker_stream": "", "speaker_regex": ["laura"], "text_cleaners": ["flowtron_cleaners"], "randomize": false, "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 24000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false, "max_frame_length": 300 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 84, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } } Any insights would be appreciated!

NVIDIA / flowtron

Training with preprocessed txt input and mel-spectrogram input #133