NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.61k stars 3.24k forks source link

Clarification on the inclusion of Multispeaker training dataset during training. #838

Closed ArEnSc closed 3 years ago

ArEnSc commented 3 years ago

@alancucki Read the paper, I got some early results in conditioning but the model is overfitting, I am wondering if thats because, in your paper you had mentioned that you used LJSpeech WITH two extra speakers dataset.

Right now I am using the pretrained 200518 model, with an extra speaker. i.e not training with LJ Speech, would that cause over fitting as the model is not being conditioned with LJSpeech while being exposed to the new speaker?

That is, do I need to train with LJSpeech + conditioned speaker even with the pretrained model to obtain good results and prevent overfitting?

Thank you!

ArEnSc commented 3 years ago

So I trained another model this time 1000 epoch and it converged at Mel Train Loss 0.10 and Mel Val Loss 1.82

This is the result I am seeing I wonder if you have ever encountered this Using the multispeaker inference I get this "text" -> "Audio" "Here is a ball" -> "Here is a ball" "The ball is here" -> "The ball is her"

ArEnSc commented 3 years ago

Heres my guess at @alancucki , the alignments provided from tacotron2 pretrained model are not suitable for the multispeaker training and I might need to actually use a forced aligner, and to generate my own dataset independent of tacotron2 let me know if that makes sense. Take the graphemes -> phoenmes as well somewhere along the alignment data processing stage

jmasterx commented 3 years ago

@ArEnSc if you end up making something that can extract alignments with Montreal Forced Aligner or something I would really appreciate if you shared it :)

Also have you tried continuing to train the pretrained tacotron 2 model on your dataset so it can adjust to the alignments of your dataset?

alancucki commented 3 years ago

Hi @ArEnSc ,

about overfitting: validation loss is a poor proxy for overfitting; it typically goes up, even though the perceived quality improves. Conversly, you can easily bring it lower with aggressive regularization, while the quality drops. Try monitoring the quality by comparing subsequent checkpoints (e.g., every 100-200 epochs) to get a feeling if it really overfits.

do I need to train with LJSpeech + conditioned speaker even with the pretrained model

I'm not sure if I understood the question correctly. If you're asking about fine-tunning a pre-trained model with additional speakers, then it is likely to work. The outcome depends on how much data and how different (in gender/expressivity/language etc.) from L.J. those additional speakers are.

For artifacts at the end of utterance, try to append a period during inference, e.g., The ball is here. . The model trained on LJSpeech expects these punctuations.

As for the alignments, you'd need a Tacotron2 checkpoint trained on your dataset to extract durations. This method works poorly on data unseen by Tacotron2 during training. Note that fine-tunning Tacotron2 might suffice.

@ArEnSc @jmasterx Using phone-level MFA alignments is pretty straightforward. One issue you might run into is the absence of punctuations in MFA - it will recognize phonemes, or <sil>. You'd need to alter text cleaners to map punctuations and spaces to silence - either by combining punctuations with spaces into meta-symbols (e.g., ,, ;) or mapping <sil> to space and embedding punctuations with duration 0. Whatever the approach, it's important to keep punctuations in the input data.

jmasterx commented 3 years ago

Using the AsIdeas/ForwardTacotron repo, which in many ways resembles FastPitch, I was able to get very good results when I trained a Tacotron2 model trained on lots of data, then continued training with a tiny dataset, then extracted alignments.

However the key difference is in that repository the tacotron 2 model decoder frames / step can be configured so the training schedules gradually goes from 10 to 1 step and this really helps get alignments fast. This repository only supports R=1, but I will be testing the same process.

I trained a FastSpeech2 model using MFA alignments and indeed the punctuation needs to be filled in, but I felt this hurt the quality of the model.

For the missing endings, I also did a trick with forward tacotron that was even worse, I would do:

"The ball is here..." and this got it right every time ;)

ArEnSc commented 3 years ago

@alancucki Thanks for the response, I will give your suggestions a shot and follow up

ArEnSc commented 3 years ago

@ArEnSc if you end up making something that can extract alignments with Montreal Forced Aligner or something I would really appreciate if you shared it :)

Also have you tried continuing to train the pretrained tacotron 2 model on your dataset so it can adjust to the alignments of your dataset?

I didn't try that just yet, and yeah no worries if I do ill let you know https://discord.gg/bFFAugdW this is where I usually hangout if you want to chat about this stuff also other people here as well