NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.21k stars 3.17k forks source link

[FastPitch 1.1/PyTorch] Advice/best practices for good alignment when fine-tuning #1004

Closed DanRuta closed 2 years ago

DanRuta commented 2 years ago

Related to FastPitch 1.1/PyTorch

Hi @alancucki @rafaelvalle. I've been experimenting with the FastPitch 1.1 update incorporating rad-tts into fastpitch, since the commit a while back. The alignment mechanism is amazing, and enables some great things (eg s2s).

However, I'm unfortunately having some issues with the convergence on some datasets with this approach, compared to 1.0. I've successfully converged an LJSpeech model quite well, but when fine-tuning a pre-trained model (like the one provided), it seems that the alignment is having some real trouble converging.

I have tried it on 4 datasets so far - 2 male, 2 female. I did get something ALMOST good with one of the female datasets (~9h), but they all converge to a fairly high KL loss (>=0.85 at 40k-120k its, compared to less than 0.35 where I stopped LJ at 25k its). I added soft and hard alignment plots to the logs, and they resemble plots c) and d) from Fig 2 in the RAD-TTS paper. I noticed also in Figure 2 from "One TTS Alignment To Rule Them All", that the convergence speed was lower using RAD-TTS with FastPitch (compared to Tacotron2 durs), before they arrived at a similar point - could this be exacerbated by smaller datasets?

I have tried experimenting with resuming the LJ optimizer (from the model I trained myself) as well as one newly initialized (from the provided LJ model), with and without including the KL weight warm-up stage. I also tried with and without arpabet, with and without energy conditioning, and also several tweaks to lr scheduling, and other such things, but I can never get anything as good as LJ (in the KL loss at least).

When running inference, the sentence composition quality varies between datasets, ranging from missing letters to missing words, and for the smaller datasets, quite difficult to understand speech, spoken very fast.

The same datasets worked very well in the previous Tacotron2+FastPitch set-up, so I'm confident that the data quality is high. Have you by any chance had any successes yourselves with something other than LJ? And would you have any tips/advice for how to better converge the alignment on smaller datasets (with transfer learning)?

Thank you for all your great work!

rafaelvalle commented 2 years ago

fastpitch2 already has the alignment method used in radtts and i think there's a verion on nvidia's nemo repo. adrian can you point you the code.

DanRuta commented 2 years ago

Is this different from FastPitch 1.1? I assumed that was the final version where RAD-TTS was added (I didn't add it in myself) - exciting, if I was wrong.

I see it listed in the nemo repo - I will have a look through it today.

I also see a link on ngc (https://ngc.nvidia.com/catalog/models/nvidia:tao:speechsynthesis_english_fastpitch) which mentions FastPitch 2 (btw, the arxiv link points to 1.0), but is this different from the code in this repo?

alancucki commented 2 years ago

Hi @DanRuta! Sorry for a laggy reply.

The same datasets worked very well in the previous Tacotron2+FastPitch set-up

Did those datasets have their own Tacotron 2 models, or did you predict their durations with LJ Tacotron 2?

I have tried it on 4 datasets so far - 2 male, 2 female. I did get something ALMOST good with one of the female datasets

Pretty much all datasets I've tried worked without issues in training from scratch. When the dataset is smaller, I had good results combining few speakers and using speaker embeddings (there is support for that in the code already). Note that kl warmup schedule should then be re-scaled proportionally.

I assume you're fine-tuning instead of training from scratch due to resource constraints. It is a long shot, but I'd try to add the smaller dataset (say 9 h) to LJSpeech-1.1 and fine-tune on both of them. Speaker embedding for LJ could be initially a vector of zeros (it's additive anyway). If you have a tiny dataset, you can quickly increase its size relative to LJ just concatenating a couple of times its filelist. Also, remember that pitch normalizing constants (mean and std) are by default set to those of LJ.

Is this different from FastPitch 1.1?

No, I think @rafaelvalle meant v1.1. We've reserved the v2 version number for a more substantial update, but it seems that it already caught on.

btw Thanks for the detailed feedback, it's really valuable! Let me know how if adding speakers for finetuning fixes thigs.

DanRuta commented 2 years ago

Hi, not at all! Same here!

Did those datasets have their own Tacotron 2 models, or did you predict their durations with LJ Tacotron 2?

I've been training (fine-tuning) Tacotron2 models for every FastPitch model I've been training (~425 models so far), with success almost every time, even <=~5min datasets - failing usually just on voices with "temporal distortion" like echos (expected, due to the architecture).

Note that kl warmup schedule should then be re-scaled proportionally. I did notice that changing with dataset sizes. I've been duplicating dataset lines to roughly match the number of lines LJ has.

I've been messing with the code a bit more since, and I think the issue was somewhere in the data pre-processing step. I had to change a couple of small bits to have things run on Windows, and I likely inadvertently broke something in the process. I've fixed it now, and things seem to be working great, now! (~34k frames/sec on 3090 and 5950x, with amp and batch size 32; ~50k f/s on LJ, somehow). I've tried a few male and female voices, just to check a variety of voice types, and they seem to be working so far.

Apologies for the confusion. I've tweaked a couple of things to allow non-binary arpabet probability, and I'm now training models with 0.5 prob of text/phoneme input. This expectedly seems to work better, for mixed input inference. Speech-to-speech is also working quite well, which is exciting!

I did have a question. With 1.0, the pitch mean/std values were computed from the tacotron durations, but this is no longer possible, as there are no durations to use at this point. I've scripted it into the dataloader during data pre-processing (using pyin) for whole sequence mena/std, but the values are quite different from before. The LJ default values in 1.1 are somewhat similar to the values in 1.0, so I'm wondering if you also used the same process.

Finally, I know with 1.0 that there was no recommended way to do early stopping, but rather to intermittently check the quality in inference once the training loss flattens. Is this still the case?

Thank you again for the help (and my apologies for the previous issue!).

Jcwscience commented 2 years ago

@DanRuta Is there any documentation on how to fine tune FastPitch from a pretrained model?

alancucki commented 2 years ago

@DanRuta , that is quite a collection of Tacotron and FastPItch models :)

I've tweaked a couple of things to allow non-binary arpabet probability, and I'm now training models with 0.5 prob of text/phoneme input.

Does it work equally well with FastP 1.0 and 1.1?

The LJ default values in 1.1 are somewhat similar to the values in 1.0, so I'm wondering if you also used the same process.

tbh I haven't seen much difference when tweaking mean/std values. I guess getting right normalization would be more important with a multispeaker model.

Finally, I know with 1.0 that there was no recommended way to do early stopping, but rather to intermittently check the quality in inference once the training loss flattens. Is this still the case?

Unfortunately yes. And it is a bit tricky - I have a strong feeling that, when you're trying to do early stopping, the optimal epoch depends on which vocoder will be applied later on.

alancucki commented 2 years ago

@Jcwscience We don't provide any docs for fine-tuning, but it boils down to loading model weights and carefully preparing the data.

Jcwscience commented 2 years ago

@alancucki Hmm. I'm not sure my knowledge will hold up to that task yet. I may have to go back to tacotron2 for now. I have acquired more powerful hardware so the memory issue is no longer a factor. That or perhaps Flowtron. However I have been extremely confused on how Flowtron works since the script doesn't seem to find the correct set of speaker IDs. I will figure it out eventually I'm sure. Thanks for the feedback!

DanRuta commented 2 years ago

@alancucki

I haven't actually tried ARPAbet with FastPitch 1.0, but it does work very well with 1.1! Overall better quality than 1.0 models for just text input too, despite having trained with 0.5 probability of using phonemes.

Much better in fact for smaller datasets especially, though I did split the training into 3 stages (alignment, dur pred, the rest), on the intuition that maybe not training the pitch/energy/decoder components with bad durations at the start won't throw the parameters permanently away from a good minima given the limited data, while avoiding overfitting. I've not done any rigorous scientific comparisons on this yet though.

Thank you for all the help/advice, everything seems to be working great :) - I've gotten through quite a few very different voices now, all turned out very well.

dsplog commented 2 years ago

@DanRuta : nice and helpful to share your observations. can you please comment on how the model was extended for speech-to-speech synthesis.

longjoke commented 2 years ago

@DanRuta did you train a LJ model with 0.5 probability and then use transfer learning for the smaller datasets or are you using the NVIDIA model?

DanRuta commented 2 years ago

@dsplog No extension needed. Given a transcript for the input audio (gt, or ASR inferred), the RAD-TTS alignment in a FastPitch model trained on a reference voice A can infer durations for every symbol. Those durations can then be used to extract pitch and energy values from the reference audio. Then, using the durations, pitch, and energy values, you can perform partial inference with a FastPitch model, for your target voice B. You can use/see an implementation of this here, or see a video of it here.

@longjoke I fine-tuned the NVIDIA LJ model using 0.5 ARPAbet probability first, before transfer learning on my own datasets.