NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

Training with preprocessed txt input and mel-spectrogram input #133

Open youuuw opened 2 years ago

youuuw commented 2 years ago

Hi, Thank you for great paper! I've been having problems training a Flowtron model with my own dataset on 8 Tesla V100.

Some information about this dataset:

  1. The text inputs are sequences of ids that each represents a phoneme in a provided dictionary.
  2. The mel-spectrograms are extracted offline with different hyper-parameters from the default ones provied in the config.json file in this repo.
  3. The dataset is in English.
  4. The dataset has only one speaker.
  5. The dataset has around 11k sentences in training set and 130 sentences in validation set.
  6. The maximum frame length is 300.

My problem is that the nll loss starts shaking tremedously after reaching a certain number. I've tried different combinations of learning rate and weight decay, the shaky loss is not improved whatsoever. I'm wondering is this is normal as I didn't see similar situation in the issues in this repo. The loss can go up to over 10 quite often.

The picture of the loss curve

Screenshot 2021-10-02 at 03 08 58

I will also attach the config that I used to train { "train_config": { "output_directory": "output_dir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-5, "weight_decay": 1e-7, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 32, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true, "gate_loss": true, "use_ctc_loss": true, "ctc_loss_weight": 0.01, "blank_logprob": -8, "ctc_loss_start_iter": 10000 }, "data_config": { "train_tdd": "train.tdd", "val_tdd": "val.tdd", "mf_dirs": ["mf", "mf_2.0"], "lf_dirs": ["lf", "lf_2.0"], "speaker_format": "label", "speaker_dir": "", "speaker_stream": "", "speaker_regex": ["laura"], "text_cleaners": ["flowtron_cleaners"], "randomize": false, "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 24000, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 0.0, "prior_cache_path": "/attention_prior_cache", "betab_scaling_factor": 1.0, "keep_ambiguous": false, "max_frame_length": 300 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 84, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } } Any insights would be appreciated!

andi-808 commented 2 years ago

I would say that you have a bad training example. The text may not match the clip exactly. I found that my graphs would look choppy like this when the data was bad. as soon as I cleaned up the errors, it went away.