Hi,
Thank you for great paper!
I've been having problems training a Flowtron model with my own dataset on 8 Tesla V100.
Some information about this dataset:
The text inputs are sequences of ids that each represents a phoneme in a provided dictionary.
The mel-spectrograms are extracted offline with different hyper-parameters from the default ones provied in the config.json file in this repo.
The dataset is in English.
The dataset has only one speaker.
The dataset has around 11k sentences in training set and 130 sentences in validation set.
The maximum frame length is 300.
My problem is that the nll loss starts shaking tremedously after reaching a certain number. I've tried different combinations of learning rate and weight decay, the shaky loss is not improved whatsoever. I'm wondering is this is normal as I didn't see similar situation in the issues in this repo. The loss can go up to over 10 quite often.
I would say that you have a bad training example. The text may not match the clip exactly. I found that my graphs would look choppy like this when the data was bad. as soon as I cleaned up the errors, it went away.
Hi, Thank you for great paper! I've been having problems training a Flowtron model with my own dataset on 8 Tesla V100.
Some information about this dataset:
My problem is that the nll loss starts shaking tremedously after reaching a certain number. I've tried different combinations of learning rate and weight decay, the shaky loss is not improved whatsoever. I'm wondering is this is normal as I didn't see similar situation in the issues in this repo. The loss can go up to over 10 quite often.
The picture of the loss curve
I will also attach the config that I used to train
{ "train_config": { "output_directory": "output_dir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-7, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 32, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.01, "blank_logprob": -8, "ctc_loss_start_iter": 10000 }, "data_config": { "train_tdd": "train.tdd", "val_tdd": "val.tdd", "mf_dirs": ["mf", "mf_2.0"], "lf_dirs": ["lf", "lf_2.0"], "speaker_format": "label", "speaker_dir": "", "speaker_stream": "", "speaker_regex": ["laura"], "text_cleaners": ["flowtron_cleaners"], "randomize": false, "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 24000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false, "max_frame_length": 300 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 84, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }
Any insights would be appreciated!