Closed ajaiswal1008 closed 2 years ago
change attention type to something else. For faster training disable double decoder consistency
Hi @erogol , Thanks for your response.
I changed the attention type to dynamic convolution and removed doubled decoder consistency. It improved the alignments significantly. Below is the image
I also have a hifi-gan based vocoder trained on the same data(~100K steps). Below are the audio samples of the tacotron model using a) griffin-lin and b) hifi-gan vocoder
When listening to the audios you will notice that the vocoder audio has a lot of background noise. It would be great if you could help me answer the following questions:
Any help would be very valuable.
Thanks
@ajaiswal1008 dude i have some queries related to this can we talk.
How can i remove the noise and make it sound more studio grade?
Hey, could you share samples from the dataset please ?
Hey @WeberJulian - Below is the link of audio files of my dataset: https://soundcloud.com/anchal-jaiswal-61632844/sets/dataset?utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing
@Arjunprasaath - Sure, let me know your queries. Would be happy to help
Ok so after listenting to some audios from your dataset, I'm pretty sure the background noise comes from your dataset. You need cleaner data if you want cleaner output.
We're working on a denoiser and bandwidth extension models, you can follow it's progress here https://github.com/coqui-ai/TTS/pull/1451
Also fine tuning HifiGAN on your TTS model output might help
Agreed, the data has a bit of noise which might be causing the noise in TTS audio. I actually trained a Hifi-GAN, from scratch, for the same dataset (~180K steps) and noticed that the EvalAudio has significantly less noise than the samples generated with tacotron and Hifi-GAN combined which is why i was hopeful of getting better quality TTS audios. Below is the link of eval audio from Hifi-GAN
I close this issue as the core problem looks to be solved. Feel free to reopen or continue the discussion.
@ajaiswal1008 Are you using this dataset If yes then it has mp3 files renamed as wav. Mp3 will have compression artifacts, which might ruin the output quality of TTS
@anuragshas You are right, the audio is compressed and has a bit of noise. Do you know of any other TTS dataset in Hindi which is comparable to this one in size?
Were you able to create a decent model? Have you uploaded it anywhere?
Hey, I'm using nvidia memo tts tacotron2 model, so you guys know how to plot attention graph for the trained model.
@ajaiswal1008 hello sir can you please help me.
Hello,
I am training a Tacotron2 model with my custom dataset in Hindi. Dataset Details: 25 hours of data, 22 KHz, 16 bit, Single Female Speaker
Issue: The training is very slow, and the loss seems to be stuck. Eval Audio sounds ok, but audios generated from test sentences are not understandable. Question: Should I train more (300 epochs are already done) or is there something else I can try to generate better and faster results?
I am using 4 Tesla V100 GPUs for training with a batch size of 32. I am also attaching my config.json for detailed information
Any help regarding how I can get better results would be really appreciated
Below is my config:
{ "model": "tacotron2", "run_name": "iiith_female_graves_tp", "run_description": "", "epochs": 1000, "batch_size": 32, "eval_batch_size": 8, "mixed_precision": false, "scheduler_after_epoch": false, "run_eval": true, "test_delay_epochs": -1, "print_eval": true, "dashboard_logger": "tensorboard", "print_step": 25, "plot_step": 100, "model_param_stats": false, "project_name": null, "log_model_step": null, "wandb_entity": null, "save_step": 10000, "checkpoint": true, "keep_all_best": false, "keep_after": 10000, "num_loader_workers": 4, "num_eval_loader_workers": 4, "use_noise_augment": false, "use_language_weighted_sampler": false, "output_path": "/home/ubuntu/coqui/tts-coqui", "distributed_backend": "nccl", "distributed_url": "tcp://localhost:54321", "audio": { "fft_size": 1024, "win_length": 1024, "hop_length": 256, "frame_shift_ms": null, "frame_length_ms": null, "stft_pad_mode": "reflect", "sample_rate": 22000, "resample": false, "preemphasis": 0.99, "ref_level_db": 0, "do_sound_norm": false, "log_func": "np.log", "do_trim_silence": true, "trim_db": 60.0, "do_rms_norm": false, "db_level": null, "power": 1.5, "griffin_lim_iters": 60, "num_mels": 80, "mel_fmin": 0.0, "mel_fmax": 8000, "spec_gain": 1.0, "do_amp_to_db_linear": true, "do_amp_to_db_mel": true, "signal_norm": false, "min_level_db": -100, "symmetric_norm": true, "max_norm": 4.0, "clip_norm": true, "stats_path": null }, "use_phonemes": true, "use_espeak_phonemes": true, "phoneme_language": "hi", "compute_input_seq_cache": false, "text_cleaner": "hindi_cleaners", "enable_eos_bos_chars": false, "test_sentences_file": "", "phoneme_cache_path": "/home/ubuntu/coqui/tts-coqui/phonemecache", "characters": { "pad": "", "eos": "~", "bos": "^", "characters": "\u0905\u0906\u0907\u0908\u0909\u090a\u090b\u090f\u0910\u0911\u0913\u0914\u0915\u0916\u0917\u0918\u0919\u091a\u091b\u091c\u091d\u091e\u091f\u0920\u0921\u0922\u0923\u0924\u0925\u0926\u0927\u0928\u092a\u092b\u092c\u092d\u092e\u092f\u0930\u0932\u0935\u0936\u0937\u0938\u0939\u0939\u093c\u093e\u093f\u0940\u0941\u0942\u0943\u0947\u0948\u0949\u094b\u094c\u094d!'(),-.:;? ", "punctuations": "!'\",.:?\u0964 ", "phonemes": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u025a\u02de\u026b", "unique": true }, "batch_group_size": 0, "loss_masking": true, "sort_by_audio_len": false, "min_seq_len": 1, "max_seq_len": 150, "compute_f0": false, "compute_linear_spec": false, "add_blank": false, "datasets": [ { "name": "ljspeech", "path": "/home/ubuntu/coqui/dataset/iiith_female_downsampled/", "meta_file_train": "metadata.csv", "ignored_speakers": null, "language": "", "meta_file_val": "", "meta_file_attn_mask": "" } ], "optimizer": "RAdam", "optimizer_params": { "betas": [ 0.9, 0.998 ], "weight_decay": 1e-06 }, "lr_scheduler": "NoamLR", "lr_scheduler_params": { "warmup_steps": 4000 }, "test_sentences": [ "\u0907\u0938 \u0906\u0927\u093e\u0930 \u092a\u0930 \u0935\u094b\u091f \u092e\u093e\u0902\u0917\u0928\u093e \u0938\u0902\u0935\u093f\u0927\u093e\u0928 \u0915\u0940 \u092d\u093e\u0935\u0928\u093e \u0915\u0947 \u0916\u093f\u0932\u093e\u092b \u0939\u0948\u0964", "\u0935\u0947 \u0915\u0939\u0924\u0947 \u0939\u0948\u0902 '\u092e\u0947\u0930\u0947 \u092a\u093f\u0924\u093e \u0915\u094b \u0935\u0947\u0902\u091f\u0940\u0932\u0947\u091f\u0930 \u0938\u0947 \u0939\u091f\u093e \u0926\u093f\u092f\u093e \u0917\u092f\u093e \u0925\u093e\u0964", "\u092a\u094d\u0930\u0926\u0942\u0937\u093f\u0924 \u0928\u0926\u093f\u092f\u094b\u0902 \u092e\u0947\u0902 \u0938\u0947 \u090f\u0915 \u0939\u0948\u0926\u0930\u093e\u092c\u093e\u0926 \u0915\u0940 '\u092e\u0942\u0938\u0940 \u0928\u0926\u0940\u0964", "\u0905\u0916\u093f\u0932\u0947\u0936 \u091c\u094b \u0915\u0930 \u0930\u0939\u093e \u0939\u0948 \u0909\u0938\u0947 \u0915\u0930\u0928\u0947 \u0926\u094b\u0964", "\u0917\u094c\u0930\u0924\u0932\u092c \u0939\u0948 \u0915\u093f \u092e\u0902\u0917\u0932\u0935\u093e\u0930 \u0915\u094b \u0905\u0928\u094d\u0928\u093e \u0915\u0947 \u0905\u0928\u0936\u0928 \u0915\u093e \u0924\u0940\u0938\u0930\u093e \u0926\u093f\u0928 \u0939\u0948\u0964" ], "use_gst": false, "gst": null, "gst_style_input": null, "num_speakers": 1, "num_chars": 127, "r": 2, "gradual_training": null, "memory_size": -1, "prenet_type": "original", "prenet_dropout": true, "prenet_dropout_at_inference": false, "stopnet": true, "separate_stopnet": true, "stopnet_pos_weight": 10.0, "max_decoder_steps": 500, "encoder_in_features": 512, "decoder_in_features": 512, "decoder_output_dim": 80, "out_channels": 80, "attention_type": "graves", "attention_heads": 4, "attention_norm": "sigmoid", "attention_win": false, "windowing": false, "use_forward_attn": false, "forward_attn_mask": false, "transition_agent": false, "location_attn": true, "bidirectional_decoder": false, "double_decoder_consistency": true, "ddc_r": 6, "use_speaker_embedding": false, "speaker_embedding_dim": 512, "use_d_vector_file": false, "d_vector_file": false, "d_vector_dim": null, "lr": 0.0001, "grad_clip": 5.0, "seq_len_norm": false, "decoder_loss_alpha": 0.25, "postnet_loss_alpha": 0.25, "postnet_diff_spec_alpha": 0.25, "decoder_diff_spec_alpha": 0.25, "decoder_ssim_alpha": 0.25, "postnet_ssim_alpha": 0.25, "ga_alpha": 5.0 }