Model training too long without alignment.

brandokoch commented 4 years ago

Hi,I appreciate you great work with flowtron, loved the paper. I have gone through all the issues and the paper but still have some problems with getting a proprietary female voice to have good alignment, I will first list some preassumptions I have so somebody can correct me if something is wrong.(this list can also be useful to help someone starting)

How I have understood the process of training should be done:

Better to use a pretrained model for warmstart, example flowtron_ljs.pt or tacotron2 checkpoint with the config.json include_layers set to ['encoder','embedding']... also dont set checkpoint_path but warmstart_checkpoint_path to the path of the model instead
the first training cycle should be with config.json flow=1 and +1 for each other cycle
for beggining cycles its better to have less speaker and more data and later you can specialize for more speakers and less data
pretrained model dataset dependant layers can be ignored so set train_config.ignore_layers=["speaker_embedding.weight"]
one should train a model until loss plateus/overfits/attention looks good, than stop training take the checkpoint put its path to warmstart_checkpoint_path (or just checkpoint_path?) on it with include_layers set to '' (nothing)
some suggest lowering the learning rate down to 9e-5
be sure to fix a few bugs, the one with the byte to bool and the one where model=warmstart(...) misses the include_layers argument so defaults to None which is bad.
when training set sigma to 1, when inferencing set sigma to 0.5
when inferencing set gate_threshold to 0.5
dont forget to set config.json n_speakers to match with the num of speakers in your dataset
dont forget to format to your dataset the right way... text file should include lines like 'path_to_wav|sentence|speaker_num' and the audio files should be 22050hz mono 16bit

What I am not sure about:

can the second flow be on same data. If so whats the advantage of doing more flows if the first produces good alignment?
in how many iterations can I expect the model to produce good alignment?
if the loss plateus for a long time and the aligment is not good what to do?
is validation loss representative of the aligment? and what validation loss did you achieve?
if you are trying to get good on a single speaker only is it better to use tacotron2 instead, or flowtron adds some stability with more speakers? (Please answer this one)

Now specific to my problem:

I have trained 2 seperate models with difference in preprocessed text but on same data.

Model A

The first model (lets call it A) is trained warm started on your flowtron_ljs.pt with 3 datasets (one of those is ljspeech and 2 are proprietary with about 40000 sentences all combined).I will list the config.json and the run command for the run. Trained up to 1,300,000 iterations for 5 days with 4x1080Ti and it produces no alignment.

config.json: { "train_config": { "output_directory": "outdir", "epochs": 10000000, "learning_rate": 1e-4, "weight_decay": 1e-6, "sigma": 1.0, "iters_per_checkpoint": 5000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": false }, "data_config": { "training_files": "data/processed/combined/dataset.train", "validation_files": "data/processed/combined/dataset.test", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" },
"model_config": { "n_speakers": 3, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true } }

command: python -m torch.distributed.launch --use_env --nproc_per_node=4 train.py -c config.json -p train_config.output_directory=outdir train_config.ignore_layers=["speaker_embedding.weight"] train_config.warm_checkpoint_path="models/flowtron_ljs.pt" train_config.fp16=true

graphs: (disclamer: all aligments are shown for a single proprietary speaker)

Model B

The second model (lets call it B) is trained warm started on our tacotron2 checkpoint with 3 datasets which were phonemized with my custom preprocessor so I turned off the preprocessing inside flowtron (one of those is ljspeech and 2 are proprietary with about 40000 sentences all combined).I will list the config.json and the run command for the run.Trained up to 1,300,000 iterations for 5 days with 4x1080Ti and it produced slight better aligment but this isnt it.

{ "train_config": { "output_directory": "outdir", "epochs": 10000000, "learning_rate": 1e-4, "weight_decay": 1e-6, "sigma": 1.0, "iters_per_checkpoint": 5000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": false }, "data_config": { "training_files": "data/processed/combined/dataset.train", "validation_files": "data/processed/combined/dataset.test", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" },

"model_config": {
    "n_speakers": 3,
    "n_speaker_dim": 128,
    "n_text": 185,
    "n_text_dim": 512,
    "n_flows": 1,
    "n_mel_channels": 80,
    "n_attn_channels": 640,
    "n_hidden": 1024,
    "n_lstm_layers": 2,
    "mel_encoder_n_hidden": 512,
    "n_components": 0,
    "mean_scale": 0.0,
    "fixed_gaussian": true,
    "dummy_speaker_embedding": false,
    "use_gate_layer": true
}

}

command: python -m torch.distributed.launch --use_env --nproc_per_node=4 train.py -c config.json -p train_config.output_directory=outdir train_config.ignore_layers=["speaker_embedding.weight"] train_config.warm_checkpoint_path="models/checkpoint_70000_fav1" train_config.fp16=true

graphs:(disclamer: all aligments are shown for a single proprietary speaker)

Is something wrong because it takes 5 days of training for this? Should I stop this and continue on a checkpoint with flow=2?

rafaelvalle commented 4 years ago

can the second flow be on same data. If so whats the advantage of doing more flows if the first produces good alignment? Yes and they should be trained on the same data. Doing more flows gives the model more capacity and can produce better speech/
in how many iterations can I expect the model to produce good alignment? In less than 24 hours on a NVIDIA V100 with batch size 1.
if the loss plateus for a long time and the aligment is not good what to do? If you're using a single step of flow, use it to warm start a 2-step of flow model
is validation loss representative of the aligment? and what validation loss did you achieve? Not necessarily, look at the alignment itself.
if you are trying to get good on a single speaker only is it better to use tacotron2 instead, or flowtron adds some stability with more speakers? (Please answer this one) Flowtron is better than Tacotron 2.

brandokoch commented 4 years ago

Is this the correct config.json/command for second flow?

{ "train_config": { "output_directory": "outdir", "epochs": 10000000, "learning_rate": 1e-4, "weight_decay": 1e-6, "sigma": 1.0, "iters_per_checkpoint": 5000, "batch_size": 1, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "include_layers": [], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": false }, "data_config": { "training_files": "data/processed/combined/dataset.train", "validation_files": "data/processed/combined/dataset.test", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0 }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" },
"model_config": { "n_speakers": 3, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true }, }

python -m torch.distributed.launch --use_env --nproc_per_node=4 train.py -c config.json -p train_config.output_directory=outdir train_config.warm_checkpoint_path="/workspace/models/model_1605000.pt" train_config.fp16=true

rafaelvalle commented 4 years ago

Yes, this should work.

brandokoch commented 4 years ago

Here are the results after 560,000 iterations for flow 2, no alignment Screenshot from 2020-07-13 14-21-18

NVIDIA / flowtron