NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

amount of data for single speaker #117

Open stqc opened 3 years ago

stqc commented 3 years ago

Hi, I am trying to develop the model for a single speaker , and I happen to have around roughly 10-12 minutes of data. would this be enough to get decent passable results? (I plan on using a pretrained model btw)

should I use the flowtron_ljs or flowtron_libritts2k (since this is few shots) ?

also a request if at all possible could you provide a colab notebook for training?

rafaelvalle commented 3 years ago

yes, it is possible to get decent results with the amount of data you have. the closer your speaker is to existing speakers in flowtron_libritts2k the better it will sound. use flowtron_libritts2k, change config.json to work with your data and call python train.py -c config.json -p train_config.finetune_layers=["speaker_embedding.weight"] train_config.checkpoint_path="models/flowtron_libritts2k.pt"

you'll need to create a filelist (https://github.com/NVIDIA/flowtron/tree/master/filelists) for your data. you can set your speaker to any in libritts.

stqc commented 3 years ago

Thank you for the reply, I will train the model according to your recommendations and share my results soon!

shehrum commented 3 years ago

Hi @rafaelvalle , I'm trying to finetune on a small dataset with the flowtron_libritts2k3k.pt model, however, I'm running into this error:

`if len(ignore_layers) > 0:

    model_dict = {k: v for k, v in model_dict.items()
                  if k not in ignore_layers}
    dummy_dict = model.state_dict()
    dummy_dict.update(model_dict)
    model_dict = dummy_dict
else:

    optimizer.load_state_dict(checkpoint_dict['optimizer'])`

File "train.py", line 125, in load_checkpoint optimizer.load_state_dict(checkpoint_dict['optimizer']) KeyError: 'optimizer'

It seems like there is no key for optimizer in the saved model. What's the right way to go about fixing this?

stqc commented 3 years ago

Hi @rafaelvalle , I'm trying to finetune on a small dataset with the flowtron_libritts2k3k.pt model, however, I'm running into this error:

`if len(ignore_layers) > 0:

    model_dict = {k: v for k, v in model_dict.items()
                  if k not in ignore_layers}
    dummy_dict = model.state_dict()
    dummy_dict.update(model_dict)
    model_dict = dummy_dict
else:

    optimizer.load_state_dict(checkpoint_dict['optimizer'])`

File "train.py", line 125, in load_checkpoint optimizer.load_state_dict(checkpoint_dict['optimizer']) KeyError: 'optimizer'

It seems like there is no key for optimizer in the saved model. What's the right way to go about fixing this?

This happens if you use flowtron_libritts2k3k.pt as config.checkpoint_path using the pretrained model to warmstart (config.warmstart_checkpoint_path) instead should solve it

shehrum commented 3 years ago

Great, thanks! @stqc

stqc commented 3 years ago

So after about 102k iterations, the audio generated sounds exactly like the speaker but the spoken words are not coherent at all and there is also a weird shape to the attention_weight (training with flow =1)

flowtron2 flowtron1 flowtron3

the following is the config.json

{ "train_config": { "output_directory": "H:/fs", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-4, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 2, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true }, "data_config": { "training_files": "filelists/jennidata1.txt", "validation_files": "filelists/val.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.0, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 1e-4, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 1, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 1, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }

shehrum commented 3 years ago

Hey @stqc , do you follow the 2-step training method? i.e. training with attention prior, then training without.

I'm quite a newbie to this, and trying to train on about 20mins of speaker data.

I have set the config.json like this:

{ "train_config": { "output_directory": "outdir", "epochs": 10000000, "optim_algo": "RAdam", "learning_rate": 1e-3, "weight_decay": 1e-6, "grad_clip_val": 1, "sigma": 1.0, "iters_per_checkpoint": 1000, "batch_size": 8, "seed": 1234, "checkpoint_path": "", "ignore_layers": [], "finetune_layers": [], "include_layers": ["speaker", "encoder", "embedding"], "warmstart_checkpoint_path": "", "with_tensorboard": true, "fp16_run": true }, "data_config": { "training_files": "filelists/pen_train.txt", "validation_files": "filelists/pen_val.txt", "text_cleaners": ["flowtron_cleaners"], "p_arpabet": 0.5, "cmudict_path": "data/cmudict_dictionary", "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "max_wav_value": 32768.0, "use_attn_prior": true, "attn_prior_threshold": 1e-4, "keep_ambiguous": false }, "dist_config": { "dist_backend": "nccl", "dist_url": "tcp://localhost:54321" }, "model_config": { "n_speakers": 2311, "n_speaker_dim": 128, "n_text": 185, "n_text_dim": 512, "n_flows": 2, "n_mel_channels": 80, "n_attn_channels": 640, "n_hidden": 1024, "n_lstm_layers": 2, "mel_encoder_n_hidden": 512, "n_components": 0, "mean_scale": 0.0, "fixed_gaussian": true, "dummy_speaker_embedding": false, "use_gate_layer": true, "use_cumm_attention": false } }

And executing the following command:

python train.py -c config.json -p train_config.finetune_layers=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path="models/flowtron_libritts2p3k.pt"

and for inference:

python inference.py -c config.json -f outdir/model_9000 -w models/waveglow_256channels_universal_v5.pt -t "It is well know that deep generative models have a deep latent space!" -i 40

Note: I have speaker ID set as 40 in my file_list.

I don't know how to go about the 2-stage training. I just let it train once, and then call inference. However, my results are really bad. Would really appreciate if you can guide me a bit about training/ and if my config.json looks fine. And how many steps should I train before stopping, from the validation and train loss it looks fine as both are going down. I don't know how to interpret the attention plots though.

sid40_sigma0 5_attnlayer0 sid40_sigma0 5_attnlayer1

train val
brentcty-2020 commented 3 years ago

What's the GPU memory requirements to run this model? Doesn't look like I can run it on a 8GB 2070 Super even if I reduce batch size to 1. Any other ways to squeeze into this memory?

Thanks.

deepglugs commented 3 years ago

What's the GPU memory requirements to run this model? Doesn't look like I can run it on a 8GB 2070 Super even if I reduce batch size to 1. Any other ways to squeeze into this memory?

Thanks.

See my answer to #119

rafaelvalle commented 3 years ago

@shehrum you need to first train with the attention prior enabled and then disable it and resume training once the attention looks good.

raymond00000 commented 2 years ago

@rafaelvalle

do you mean this setting?

attention prior enabled

    "use_attn_prior": **true**,
    "attn_prior_threshold": 0.0,
    "prior_cache_path": "/attention_prior_cache",

then disable it

    "use_attn_prior": **false**,
    "attn_prior_threshold": 0.0,
    "prior_cache_path": "/attention_prior_cache",

thx for info.

actually, i tried

Fine-tuning for few-shot speech synthesis sid0_sigma0.5_attnlayer0 has a clear attn map, while, sid0_sigma0.5_attnlayer1 failed to form a clear attn map.