coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.84k stars 4.38k forks source link

[Bug] Training stopped unexpectedly #1523

Closed annaklyueva closed 2 years ago

annaklyueva commented 2 years ago

Describe the bug

Good day!

I'm trying to train YourTTS model, seems like I've done everything correct, however, my training stoped after the first 656 steps. What might be the problem?

To Reproduce

  1. run the following command: !CUDA_VISIBLE_DEVICES='0' python TTS/bin/train_tts.py --config_path config.json
  2. after that the training starts

Expected behavior

The training should last for 1000 epochs, but it went only for 3 epoches and then stopped.

Logs

> Using CUDA:  True
 > Number of GPUs:  1
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:True
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:True
 | > do_amp_to_db_linear:False
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 | > Found 21712 files in /home/vlasova/projects/speechdetox_eng/train_yourtts_ru/Coqui-TTS/datasets/cv_ru/ru
 | > Found 44085 files in /home/vlasova/projects/speechdetox_eng/train_yourtts_ru/Coqui-TTS/datasets/vctk
 > Vocoder Model: vits
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:True
 | > num_mels:64
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:True
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:True
 | > num_mels:64
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:512
 | > power:1.5
 | > preemphasis:0.97
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:False
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:False
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:True
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:10
 | > hop_length:160
 | > win_length:400
 > External Speaker Encoder Loaded !!
 > Language manager is loaded with 2 languages: en, ru-ru

 > Model has 86830276 parameters

 > EPOCH: 0/1000
 --> YourTTS_ru/vits_tts-rus-April-19-2022_10+23PM-fd599f57

 > DataLoader initialization
 | > Use phonemes: False
 | > Number of instances : 65357
 | > Max length sequence: 185
 | > Min length sequence: 3
 | > Avg length sequence: 46.60888657680126
 | > Num. instances discarded by max-min (max=270, min=90) seq limits: 59649
 | > Batch group size: 0.
 > Using Language weighted sampler

 > TRAINING (2022-04-19 22:23:14) 

   --> STEP: 0/178 -- GLOBAL_STEP: 0
     | > loss_spk_encoder: -0.58142  (-0.58142)
     | > loss_gen: 6.02864  (6.02864)
     | > loss_kl: 199.16060  (199.16060)
     | > loss_feat: 0.46400  (0.46400)
     | > loss_mel: 94.66317  (94.66317)
     | > loss_duration: 1.40519  (1.40519)
     | > loss_0: 301.14017  (301.14017)
     | > grad_norm_0: 2519.89355  (2519.89355)
     | > loss_disc: 6.02872  (6.02872)
     | > loss_1: 6.02872  (6.02872)
     | > grad_norm_1: 8.47941  (8.47941)
     | > step_time: 1.76600  (1.76600)
     | > loader_time: 137.33940  (137.33939)

   --> STEP: 1/178 -- GLOBAL_STEP: 1
     | > loss_spk_encoder: -0.53030  (-0.53030)
     | > loss_gen: 4.45582  (4.45582)
     | > loss_kl: 117.64113  (117.64113)
     | > loss_feat: 0.44960  (0.44960)
     | > loss_mel: 74.26104  (74.26104)
     | > loss_duration: 1.37012  (1.37012)
     | > loss_0: 197.64740  (197.64740)
     | > grad_norm_0: 979.22784  (979.22784)
     | > loss_disc: 4.57224  (4.57224)
     | > loss_1: 4.57224  (4.57224)
     | > grad_norm_1: 7.57789  (7.57789)
     | > step_time: 1.62100  (1.62101)
     | > loader_time: 0.05510  (0.05513)

!!! REMOVED THIS PART BECAUSE IT IS TOO LONG. ONLY THE BEGINNGNG AND THE END IS LEFT  !!!

   --> STEP: 118/178 -- GLOBAL_STEP: 655
     | > loss_spk_encoder: -0.91497  (-0.84993)
     | > loss_gen: 2.46804  (3.00348)
     | > loss_kl: 1.98680  (2.11796)
     | > loss_feat: 4.28438  (5.38346)
     | > loss_mel: 30.07989  (31.81439)
     | > loss_duration: 1.42859  (1.40404)
     | > loss_0: 39.33273  (42.87341)
     | > grad_norm_0: 48.04468  (64.45129)
     | > loss_disc: 2.21401  (1.92375)
     | > loss_1: 2.21401  (1.92375)
     | > grad_norm_1: 18.31875  (18.43089)
     | > step_time: 7.73820  (7.06272)
     | > loader_time: 490.40390  (56.16531)

   --> STEP: 119/178 -- GLOBAL_STEP: 656
     | > loss_spk_encoder: -0.99244  (-0.85113)
     | > loss_gen: 2.58704  (2.99998)
     | > loss_kl: 2.10689  (2.11787)
     | > loss_feat: 4.68556  (5.37760)
     | > loss_mel: 31.52000  (31.81192)
     | > loss_duration: 1.37776  (1.40382)
     | > loss_0: 41.28481  (42.86006)
     | > grad_norm_0: 68.21847  (64.48295)
     | > loss_disc: 2.13860  (1.92555)
     | > loss_1: 2.13860  (1.92555)
     | > grad_norm_1: 10.74519  (18.36630)
     | > step_time: 7.40580  (7.06560)
     | > loader_time: 0.26170  (55.69553)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 2080 Ti",
            "NVIDIA GeForce RTX 2080 Ti"
        ],
        "available": true,
        "version": "11.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.9.1+cu111",
        "TTS": "0.2.0",
        "numpy": "1.21.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.13",
        "version": "#106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022"
    }
}

Additional context

{ "model": "vits", "run_name": "vits_tts-rus", "run_description": "", "epochs": 1000, "batch_size": 32, "eval_batch_size": 32, "mixed_precision": false, "scheduler_after_epoch": true, "run_eval": true, "test_delay_epochs": -1, "print_eval": true, "dashboard_logger": "tensorboard", "print_step": 1, "plot_step": 100, "model_param_stats": false, "project_name": null, "log_model_step": 10, "wandb_entity": null, "save_step": 20, "checkpoint": true, "keep_all_best": false, "keep_after": 20, "num_loader_workers": 4, "num_eval_loader_workers": 4, "use_noise_augment": false, "use_language_weighted_sampler": true, "output_path": "YourTTS_ru", "distributed_backend": "nccl", "distributed_url": "tcp://localhost:54321", "audio": { "fft_size": 1024, "win_length": 1024, "hop_length": 256, "frame_shift_ms": null, "frame_length_ms": null, "stft_pad_mode": "reflect", "sample_rate": 16000, "resample": true, "preemphasis": 0.0, "ref_level_db": 20, "do_sound_norm": true, "log_func": "np.log", "do_trim_silence": true, "trim_db": 45, "power": 1.5, "griffin_lim_iters": 60, "num_mels": 80, "mel_fmin": 0.0, "mel_fmax": null, "spec_gain": 1, "do_amp_to_db_linear": false, "do_amp_to_db_mel": true, "signal_norm": false, "min_level_db": -100, "symmetric_norm": true, "max_norm": 4.0, "clip_norm": true, "stats_path": null }, "use_phonemes": false, "use_espeak_phonemes": false, "phoneme_language": "pt-br", "compute_input_seq_cache": false, "text_cleaner": "multilingual_cleaners", "enable_eos_bos_chars": false, "test_sentences_file": "", "phoneme_cachepath": null, "characters": { "pad": "", "eos": "&", "bos": "*", "characters": "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzЁЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЯЧСМИТЬБЮёйцукенгшщзхфывапролджэъячсмитьбю«»–—\u00af\u00b7\u00df\u00e0\u00e1\u00e2\u00e3\u00e4\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00ef\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f9\u00fa\u00fb\u00fc\u00ff\u0101\u0105\u0107\u0113\u0119\u011b\u012b\u0131\u0142\u0144\u014d\u0151\u0153\u015b\u016b\u0171\u017a\u017c\u01ce\u01d0\u01d2\u01d4\u0430\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u043a\u043b\u043c\u043d\u043e\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0447\u0448\u0449\u044a\u044b\u044c\u044d\u044e\u044f\u0451\u0454\u0456\u0457\u0491\u2013!'(),-.:;? ", "punctuations": "!'(),-.:;? «»–—", "phonemes": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b'\u0303' ", "unique": true }, "batch_group_size": 0, "loss_masking": null, "min_seq_len": 90, "max_seq_len": 270, "compute_f0": false, "compute_linear_spec": true, "add_blank": true, "datasets": [

    {
        "name": "common_voice",
        "path": "datasets/cv_ru/ru/",
        "meta_file_train": "train.tsv",
        "ununsed_speakers": null,
        "language": "ru-ru",
        "meta_file_val": "dev.tsv",
        "meta_file_attn_mask": ""
    },
    {
        "name": "vctk",
        "path": "datasets/vctk/",
        "meta_file_train": null,
        "ununsed_speakers": null,
        "language": "en",
        "meta_file_val": null,
        "meta_file_attn_mask": ""
    }

],
"optimizer": "AdamW",
"optimizer_params": {
    "betas": [
        0.8,
        0.99
    ],
    "eps": 1e-09,
    "weight_decay": 0.01
},
"lr_scheduler": "",
"lr_scheduler_params": null,
"test_sentences": [],
"use_speaker_embedding": false,
"use_d_vector_file": true,
"d_vector_dim": 512,
"model_args": {
    "num_chars": 165,
    "out_channels": 513,
    "spec_segment_size": 62,
    "hidden_channels": 192,
    "hidden_channels_ffn_text_encoder": 768,
    "num_heads_text_encoder": 2,
    "num_layers_text_encoder": 10,
    "kernel_size_text_encoder": 3,
    "dropout_p_text_encoder": 0.1,
    "dropout_p_duration_predictor": 0.5,
    "kernel_size_posterior_encoder": 5,
    "dilation_rate_posterior_encoder": 1,
    "num_layers_posterior_encoder": 16,
    "kernel_size_flow": 5,
    "dilation_rate_flow": 1,
    "num_layers_flow": 4,
    "resblock_type_decoder": "2",
    "resblock_kernel_sizes_decoder": [
        3,
        7,
        11
    ],
    "resblock_dilation_sizes_decoder": [
        [
            1,
            3,
            5
        ],
        [
            1,
            3,
            5
        ],
        [
            1,
            3,
            5
        ]
    ],
    "upsample_rates_decoder": [
        8,
        8,
        2,
        2
    ],
    "upsample_initial_channel_decoder": 512,
    "upsample_kernel_sizes_decoder": [
        16,
        16,
        4,
        4
    ],
    "use_sdp": true,
    "noise_scale": 1.0,
    "inference_noise_scale": 0.3,
    "length_scale": 1.5,
    "noise_scale_dp": 0.6,
    "inference_noise_scale_dp": 0.3,
    "max_inference_len": null,
    "init_discriminator": true,
    "use_spectral_norm_disriminator": false,
    "use_speaker_embedding": false,
    "num_speakers": 1244,
    "speakers_file": null,
    "d_vector_file": "speakers.json",
    "speaker_embedding_channels": 512,
    "use_d_vector_file": true,
    "d_vector_dim": 512,
    "detach_dp_input": true,
    "use_language_embedding": true,
    "embedded_language_dim": 4,
    "num_languages": 2,
    "use_speaker_encoder_as_loss": true,
    "speaker_encoder_config_path": "config_se.json",
    "speaker_encoder_model_path": "model_se.pth.tar",
    "fine_tuning_mode": 0,
    "freeze_encoder": false,
    "freeze_DP": false,
    "freeze_PE": false,
    "freeze_flow_decoder": false,
    "freeze_waveform_decoder": false
},
"grad_clip": [
    5.0,
    5.0
],
"lr_gen": 0.0002,
"lr_disc": 0.0002,
"lr_scheduler_gen": "ExponentialLR",
"lr_scheduler_gen_params": {
    "gamma": 0.999875,
    "last_epoch": -1
},
"lr_scheduler_disc": "ExponentialLR",
"lr_scheduler_disc_params": {
    "gamma": 0.999875,
    "last_epoch": -1
},
"kl_loss_alpha": 1.0,
"disc_loss_alpha": 1.0,
"gen_loss_alpha": 1.0,
"feat_loss_alpha": 1.0,
"mel_loss_alpha": 45.0,
"dur_loss_alpha": 1.0,
"speaker_encoder_loss_alpha": 9.0,
"return_wav": true,
"r": 1

}

erogol commented 2 years ago

Please try the latest version.

annaklyueva commented 2 years ago

@erogol I was doing everything according to this pipeline (https://github.com/Edresson/YourTTS/issues/8). Here, it was mentioned that for training and finetuning we have to use this branch https://github.com/Edresson/Coqui-TTS/tree/multilingual-torchaudio-SE/

If it is not correct, could you please clarify what steps should I follow?

kin0303 commented 2 years ago

I had same issues, but trying with tacotron2 models.

erogol commented 2 years ago

I don't see any errors in the logs. Without any error message or trace, it is hard for us to help. Can you try pulling some more information about how the training really ends.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.