coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.85k stars 4.38k forks source link

[Bug] Loss getting stuck while training a Tacotron2 model on custom dataset in Hindi #1369

Closed ajaiswal1008 closed 2 years ago

ajaiswal1008 commented 2 years ago

Hello,

I am training a Tacotron2 model with my custom dataset in Hindi. Dataset Details: 25 hours of data, 22 KHz, 16 bit, Single Female Speaker

Issue: The training is very slow, and the loss seems to be stuck. Eval Audio sounds ok, but audios generated from test sentences are not understandable. Question: Should I train more (300 epochs are already done) or is there something else I can try to generate better and faster results?

Screenshot 2022-03-10 at 7 56 40 PM Screenshot 2022-03-10 at 7 56 52 PM Screenshot 2022-03-10 at 7 57 06 PM Screenshot 2022-03-10 at 7 57 21 PM

I am using 4 Tesla V100 GPUs for training with a batch size of 32. I am also attaching my config.json for detailed information

Any help regarding how I can get better results would be really appreciated

Below is my config:

{ "model": "tacotron2", "run_name": "iiith_female_graves_tp", "run_description": "", "epochs": 1000, "batch_size": 32, "eval_batch_size": 8, "mixed_precision": false, "scheduler_after_epoch": false, "run_eval": true, "test_delay_epochs": -1, "print_eval": true, "dashboard_logger": "tensorboard", "print_step": 25, "plot_step": 100, "model_param_stats": false, "project_name": null, "log_model_step": null, "wandb_entity": null, "save_step": 10000, "checkpoint": true, "keep_all_best": false, "keep_after": 10000, "num_loader_workers": 4, "num_eval_loader_workers": 4, "use_noise_augment": false, "use_language_weighted_sampler": false, "output_path": "/home/ubuntu/coqui/tts-coqui", "distributed_backend": "nccl", "distributed_url": "tcp://localhost:54321", "audio": { "fft_size": 1024, "win_length": 1024, "hop_length": 256, "frame_shift_ms": null, "frame_length_ms": null, "stft_pad_mode": "reflect", "sample_rate": 22000, "resample": false, "preemphasis": 0.99, "ref_level_db": 0, "do_sound_norm": false, "log_func": "np.log", "do_trim_silence": true, "trim_db": 60.0, "do_rms_norm": false, "db_level": null, "power": 1.5, "griffin_lim_iters": 60, "num_mels": 80, "mel_fmin": 0.0, "mel_fmax": 8000, "spec_gain": 1.0, "do_amp_to_db_linear": true, "do_amp_to_db_mel": true, "signal_norm": false, "min_level_db": -100, "symmetric_norm": true, "max_norm": 4.0, "clip_norm": true, "stats_path": null }, "use_phonemes": true, "use_espeak_phonemes": true, "phoneme_language": "hi", "compute_input_seq_cache": false, "text_cleaner": "hindi_cleaners", "enable_eos_bos_chars": false, "test_sentences_file": "", "phoneme_cache_path": "/home/ubuntu/coqui/tts-coqui/phonemecache", "characters": { "pad": "", "eos": "~", "bos": "^", "characters": "\u0905\u0906\u0907\u0908\u0909\u090a\u090b\u090f\u0910\u0911\u0913\u0914\u0915\u0916\u0917\u0918\u0919\u091a\u091b\u091c\u091d\u091e\u091f\u0920\u0921\u0922\u0923\u0924\u0925\u0926\u0927\u0928\u092a\u092b\u092c\u092d\u092e\u092f\u0930\u0932\u0935\u0936\u0937\u0938\u0939\u0939\u093c\u093e\u093f\u0940\u0941\u0942\u0943\u0947\u0948\u0949\u094b\u094c\u094d!'(),-.:;? ", "punctuations": "!'\",.:?\u0964 ", "phonemes": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b", "unique": true }, "batch_group_size": 0, "loss_masking": true, "sort_by_audio_len": false, "min_seq_len": 1, "max_seq_len": 150, "compute_f0": false, "compute_linear_spec": false, "add_blank": false, "datasets": [ { "name": "ljspeech", "path": "/home/ubuntu/coqui/dataset/iiith_female_downsampled/", "meta_file_train": "metadata.csv", "ignored_speakers": null, "language": "", "meta_file_val": "", "meta_file_attn_mask": "" } ], "optimizer": "RAdam", "optimizer_params": { "betas": [ 0.9, 0.998 ], "weight_decay": 1e-06 }, "lr_scheduler": "NoamLR", "lr_scheduler_params": { "warmup_steps": 4000 }, "test_sentences": [ "\u0907\u0938 \u0906\u0927\u093e\u0930 \u092a\u0930 \u0935\u094b\u091f \u092e\u093e\u0902\u0917\u0928\u093e \u0938\u0902\u0935\u093f\u0927\u093e\u0928 \u0915\u0940 \u092d\u093e\u0935\u0928\u093e \u0915\u0947 \u0916\u093f\u0932\u093e\u092b \u0939\u0948\u0964", "\u0935\u0947 \u0915\u0939\u0924\u0947 \u0939\u0948\u0902 '\u092e\u0947\u0930\u0947 \u092a\u093f\u0924\u093e \u0915\u094b \u0935\u0947\u0902\u091f\u0940\u0932\u0947\u091f\u0930 \u0938\u0947 \u0939\u091f\u093e \u0926\u093f\u092f\u093e \u0917\u092f\u093e \u0925\u093e\u0964", "\u092a\u094d\u0930\u0926\u0942\u0937\u093f\u0924 \u0928\u0926\u093f\u092f\u094b\u0902 \u092e\u0947\u0902 \u0938\u0947 \u090f\u0915 \u0939\u0948\u0926\u0930\u093e\u092c\u093e\u0926 \u0915\u0940 '\u092e\u0942\u0938\u0940 \u0928\u0926\u0940\u0964", "\u0905\u0916\u093f\u0932\u0947\u0936 \u091c\u094b \u0915\u0930 \u0930\u0939\u093e \u0939\u0948 \u0909\u0938\u0947 \u0915\u0930\u0928\u0947 \u0926\u094b\u0964", "\u0917\u094c\u0930\u0924\u0932\u092c \u0939\u0948 \u0915\u093f \u092e\u0902\u0917\u0932\u0935\u093e\u0930 \u0915\u094b \u0905\u0928\u094d\u0928\u093e \u0915\u0947 \u0905\u0928\u0936\u0928 \u0915\u093e \u0924\u0940\u0938\u0930\u093e \u0926\u093f\u0928 \u0939\u0948\u0964" ], "use_gst": false, "gst": null, "gst_style_input": null, "num_speakers": 1, "num_chars": 127, "r": 2, "gradual_training": null, "memory_size": -1, "prenet_type": "original", "prenet_dropout": true, "prenet_dropout_at_inference": false, "stopnet": true, "separate_stopnet": true, "stopnet_pos_weight": 10.0, "max_decoder_steps": 500, "encoder_in_features": 512, "decoder_in_features": 512, "decoder_output_dim": 80, "out_channels": 80, "attention_type": "graves", "attention_heads": 4, "attention_norm": "sigmoid", "attention_win": false, "windowing": false, "use_forward_attn": false, "forward_attn_mask": false, "transition_agent": false, "location_attn": true, "bidirectional_decoder": false, "double_decoder_consistency": true, "ddc_r": 6, "use_speaker_embedding": false, "speaker_embedding_dim": 512, "use_d_vector_file": false, "d_vector_file": false, "d_vector_dim": null, "lr": 0.0001, "grad_clip": 5.0, "seq_len_norm": false, "decoder_loss_alpha": 0.25, "postnet_loss_alpha": 0.25, "postnet_diff_spec_alpha": 0.25, "decoder_diff_spec_alpha": 0.25, "decoder_ssim_alpha": 0.25, "postnet_ssim_alpha": 0.25, "ga_alpha": 5.0 }

erogol commented 2 years ago

change attention type to something else. For faster training disable double decoder consistency

ajaiswal1008 commented 2 years ago

Hi @erogol , Thanks for your response.

I changed the attention type to dynamic convolution and removed doubled decoder consistency. It improved the alignments significantly. Below is the image

Screenshot 2022-03-25 at 1 06 54 PM

I also have a hifi-gan based vocoder trained on the same data(~100K steps). Below are the audio samples of the tacotron model using a) griffin-lin and b) hifi-gan vocoder

https://soundcloud.com/anchal-jaiswal-61632844/sets/tacotron?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

When listening to the audios you will notice that the vocoder audio has a lot of background noise. It would be great if you could help me answer the following questions:

  1. Is the background noise because of the tacotron model or because of the vocoder?
  2. How can i remove the noise and make it sound more studio grade?
  3. When training both tacotron model the loss is not decreasing much after 60K steps. Should i continue training this model further or this is the best i can get from my current dataset?
  4. Is this tacotron model good enough to compute attention mask for fastspeech training?
  5. There are some mispronunciations as well with the current tacotron model. How can i improve it further?

Any help would be very valuable.

Thanks

Arjunprasaath commented 2 years ago

@ajaiswal1008 dude i have some queries related to this can we talk.

WeberJulian commented 2 years ago

How can i remove the noise and make it sound more studio grade?

Hey, could you share samples from the dataset please ?

ajaiswal1008 commented 2 years ago

Hey @WeberJulian - Below is the link of audio files of my dataset: https://soundcloud.com/anchal-jaiswal-61632844/sets/dataset?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

ajaiswal1008 commented 2 years ago

@Arjunprasaath - Sure, let me know your queries. Would be happy to help

WeberJulian commented 2 years ago

Ok so after listenting to some audios from your dataset, I'm pretty sure the background noise comes from your dataset. You need cleaner data if you want cleaner output.

We're working on a denoiser and bandwidth extension models, you can follow it's progress here https://github.com/coqui-ai/TTS/pull/1451

WeberJulian commented 2 years ago

Also fine tuning HifiGAN on your TTS model output might help

ajaiswal1008 commented 2 years ago

Agreed, the data has a bit of noise which might be causing the noise in TTS audio. I actually trained a Hifi-GAN, from scratch, for the same dataset (~180K steps) and noticed that the EvalAudio has significantly less noise than the samples generated with tacotron and Hifi-GAN combined which is why i was hopeful of getting better quality TTS audios. Below is the link of eval audio from Hifi-GAN

https://soundcloud.com/anchal-jaiswal-61632844/hifi-gan-sample?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

erogol commented 2 years ago

I close this issue as the core problem looks to be solved. Feel free to reopen or continue the discussion.

anuragshas commented 2 years ago

@ajaiswal1008 Are you using this dataset If yes then it has mp3 files renamed as wav. Mp3 will have compression artifacts, which might ruin the output quality of TTS

ajaiswal1008 commented 2 years ago

@anuragshas You are right, the audio is compressed and has a bit of noise. Do you know of any other TTS dataset in Hindi which is comparable to this one in size?

anuragshas commented 2 years ago

@ajaiswal1008 There is one Indic TTS but the size is just 5 hours for both male and female voice, it gives decent results though. You can try out the demo made by Harveen Hindi TTS. They have used GlowTTS+HiFiGAN

SaadBazaz commented 2 years ago

Were you able to create a decent model? Have you uploaded it anywhere?

Arjunprasaath commented 2 years ago

Hey, I'm using nvidia memo tts tacotron2 model, so you guys know how to plot attention graph for the trained model.

omkarade commented 2 years ago

@ajaiswal1008 hello sir can you please help me.