NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
887 stars 177 forks source link

Published Flowtron LibriTTS2K model does not include iteration or optimizer #123

Open ttt733 opened 3 years ago

ttt733 commented 3 years ago

Unless I'm missing something, the fine-tuning instructions in the readme do not work. In train.py:

iteration = checkpoint_dict['iteration']
...
    if len(ignore_layers) > 0:
        ...
    else:
        optimizer.load_state_dict(checkpoint_dict['optimizer'])

Hacking around the missing iteration value with iteration = 1 has been mentioned in previous issues, and the optimizer can be skipped over by putting a dummy value into ignore_layers, but it seems like making the published model fit the code would be ideal.

ttt733 commented 3 years ago

I cannot get the LibriTTS2K model to work with inference either, actually. I do not think the model's inheriting the weights properly, as it seems to be generating only random noise. If you see anything I'm doing wrong, let me know - if I can get it working, I'll put in a PR to update the readme instructions. config.json

{
    "train_config": {
        "output_directory": "/outdir",
        "epochs": 10000000,
        "optim_algo": "RAdam",
        "learning_rate": 1e-3,
        "weight_decay": 1e-6,
        "grad_clip_val": 1,
        "sigma": 1.0,
        "iters_per_checkpoint": 1000,
        "batch_size": 1,
        "seed": 1234,
        "checkpoint_path": "models/flowtron_libritts2p3k.pt",
        "ignore_layers": [],
        "finetune_layers": [],
        "include_layers": ["speaker", "encoder", "embedding"],
        "warmstart_checkpoint_path": "",
        "with_tensorboard": true,
        "fp16_run": true,
        "gate_loss": true,
        "use_ctc_loss": true,
        "ctc_loss_weight": 0.01,
        "blank_logprob": -8,
        "ctc_loss_start_iter": 10000
    },
    "data_config": {
        "training_files": "filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt",
        "validation_files": "filelists/libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt",
        "text_cleaners": ["flowtron_cleaners"],
        "p_arpabet": 0.5,
        "cmudict_path": "data/cmudict_dictionary",
        "sampling_rate": 22050,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0,
        "max_wav_value": 32768.0,
        "use_attn_prior": true,
        "attn_prior_threshold": 0.0,
        "prior_cache_path": "/attention_prior_cache",
        "betab_scaling_factor": 1.0,
        "keep_ambiguous": false
    },
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },
    "model_config": {
        "n_speakers": 123,
        "n_speaker_dim": 128,
        "n_text": 185,
        "n_text_dim": 512,
        "n_flows": 2,
        "n_mel_channels": 80,
        "n_attn_channels": 640,
        "n_hidden": 1024,
        "n_lstm_layers": 2,
        "mel_encoder_n_hidden": 512,
        "n_components": 0,
        "mean_scale": 0.0,
        "fixed_gaussian": true,
        "dummy_speaker_embedding": false,
        "use_gate_layer": true,
        "use_cumm_attention": false
    }
}

Command: python inference.py -o ./outdir -c config.json -f models/flowtron_libritts2p3k.pt -w models/waveglow_256channels_universal_v5.pt -t "It is well known that deep generative models have a rich latent space!" -i 1088 Output: sid1088_sigma0 5_attnlayer1 sid1088_sigma0 5_attnlayer0 Plus a 410 kb wav file of static. The waveglow model (v5) is the one linked in that repo's readme. And since it was mentioned in #74, my torch version is torch==1.8.1+cu111, though I wasn't sure what exactly was meant by "try inference in fp32."

andi-808 commented 3 years ago

waveglow_256channels_universal_v5.pt gives me nothing but noise as well. I could not figure out what was happening for a long time and then I switched to v4 and everything worked.

rafaelvalle commented 3 years ago

@ttt733 are you able to produce spectrograms with the pre-trained model?

ttt733 commented 3 years ago

No. I'm attempting to use the LibriTTS2k linked in the repo, and I've tried with waveglow v5 and v4 without success. In my latest attempt I'm also getting an error from pytorch:

~/dev/flowtron$ python inference.py -o ./outdir -c config.json -f models/flowtron_libritts2p3k.pt -w models/waveglow_256channels_universal_v4.pt -t "It is well known that deep generative models have a rich latent space!" -i 1088
/home/trevor/anaconda3/envs/blitz/lib/python3.8/site-packages/torch/serialization.py:671: SourceChangeWarning: source code of class 'torch.nn.modules.conv.ConvTranspose1d' has changed. Saved a reverse patch to ConvTranspose1d.patch. Run `patch -p0 < ConvTranspose1d.patch` to revert your changes.
  warnings.warn(msg, SourceChangeWarning)
/home/trevor/anaconda3/envs/blitz/lib/python3.8/site-packages/torch/serialization.py:671: SourceChangeWarning: source code of class 'torch.nn.modules.container.ModuleList' has changed. Tried to save a patch, but couldn't create a writable file ModuleList.patch. Make sure it doesn't exist and your working directory is writable.
  warnings.warn(msg, SourceChangeWarning)
/home/trevor/anaconda3/envs/blitz/lib/python3.8/site-packages/torch/serialization.py:671: SourceChangeWarning: source code of class 'torch.nn.modules.conv.Conv1d' has changed. Saved a reverse patch to Conv1d.patch. Run `patch -p0 < Conv1d.patch` to revert your changes.
  warnings.warn(msg, SourceChangeWarning)

The result is the same as what I posted above. Pytorch version is still 1.10.0.dev20210609.

egorsmkv commented 1 year ago

It works with waveglow_256channels_ljs_v3.pt

to download :

curl -LO 'https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ljs_256channels/versions/3/files/waveglow_256channels_ljs_v3.pt'