NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.48k stars 2.4k forks source link

Speech Synthesis (TTS) training in another language #2352

Closed lucalazzaroni closed 2 years ago

lucalazzaroni commented 3 years ago

I'd like to train my own TTS model in Italian, using the portion of the MAILABS dataset (~18 hours). If I train the Tacotron 2 model with this dataset (16 kHz) can I fine tune from a pre-existing English model? I think that would increase performance but I didn't find any tutorial or example that performs it. Another question is whether vocoder training would be necessary or whether the English/Chinese pre-trained vocoders give acceptable performance even in a different language such as Italian. Last question is if the two end-to-end models, FastPitch_HifiGan_E2E and FastSpeech2_HifiGan_E2E, can be trained or fine-tuned from pre-existing checkpoints avoiding the need to train both Spectrogram Generator and Vocoder. Thanks in advance for your answers

blisc commented 3 years ago

can I fine tune from a pre-existing English model?

I'm not aware of any research that has tried this. It might be possible if you use a phoneme-based model, but I'm not sure if we have any such models in NeMo. Maybe TalkNet is phone-based.

Another question is whether vocoder training would be necessary or whether the English/Chinese pre-trained vocoders give acceptable performance even in a different language such as Italian.

If you plan on using WaveGlow, it should work without any fine tuning. If you plan on using the GAN vocoders, it would be interesting to see if they will work as it's untested.

Last question is if the two end-to-end models, FastPitch_HifiGan_E2E and FastSpeech2_HifiGan_E2E, can be trained or fine-tuned from pre-existing checkpoints avoiding the need to train both Spectrogram Generator and Vocoder.

Yep, the idea of an end-to-end model is to avoid training two models.

As for fine-tuning in general, MAILABS should have sufficient enough data for training from scratch. So I would recommend training from scratch.

lucalazzaroni commented 3 years ago

Thank you for your answers.

Yep, the idea of an end-to-end model is to avoid training two models.

Is there any tutorial or example which involves the training of one of the two end-to-end models, FastPitch_HifiGan_E2E and FastSpeech2_HifiGan_E2E? Because I didn't find anything on them except the code on NGC here. My idea would be to train one of these two with the MAILABS dataset or train the Tacotron2 model if they are not available yet.

blisc commented 3 years ago

Is there any tutorial or example which involves the training of one of the two end-to-end models, FastPitch_HifiGan_E2E and FastSpeech2_HifiGan_E2E? Because I didn't find anything on them except the code on NGC here. My idea would be to train one of these two with the MAILABS dataset or train the Tacotron2 model if they are not available yet.

No, we don't have a notebook for the end-to-end models. The models use the same datasets as their respective spectrogram generator siblings. In effect, all you need is to preprocess the datasets according to those models. Ie for fastspeech2, you would need to use MFA to get the duration data, and for fastpitch, you would have to first train a tacotron2 model.

The new version of fastpitch 'fastpitch_align`, which is available in the main branch, does not require the trained tacotron2 model. Unforunately, this new version has not been ported to the end-to-end model.

lucalazzaroni commented 3 years ago

Hi @blisc, following your advices I'm now training a Tacotron2 model following the tutorial. Are there any changes I need to make to the config file? At the moment I only changed the labels to match the Italian portion of the MAILABS dataset, then changed sample rate to 16000 when launching the tacotron2.py script, but don't know if there are more to be made. Also, my dataset duration is about 20 hours, how many epochs do you suggest to set when launching the script? Thanks again for your help

blisc commented 3 years ago

Hi @blisc, following your advices I'm now training a Tacotron2 model following the tutorial. Are there any changes I need to make to the config file? At the moment I only changed the labels to match the Italian portion of the MAILABS dataset, then changed sample rate to 16000 when launching the tacotron2.py script, but don't know if there are more to be made. Also, my dataset duration is about 20 hours, how many epochs do you suggest to set when launching the script? Thanks again for your help

I think changing the sample rate is sufficient. The tacotron2 model we provide was trained for 800 epochs, so that should be sufficient. For tacotron2, as long as your attention map looks diagonal, you can produce audible speech.

lucalazzaroni commented 3 years ago

Hi @blisc, following your advices I'm now training a Tacotron2 model following the tutorial. Are there any changes I need to make to the config file? At the moment I only changed the labels to match the Italian portion of the MAILABS dataset, then changed sample rate to 16000 when launching the tacotron2.py script, but don't know if there are more to be made. Also, my dataset duration is about 20 hours, how many epochs do you suggest to set when launching the script? Thanks again for your help

I think changing the sample rate is sufficient. The tacotron2 model we provide was trained for 800 epochs, so that should be sufficient. For tacotron2, as long as your attention map looks diagonal, you can produce audible speech.

Thank you again. I have another question regarding the vocoder. I suppose all of them are trained with sample rate 22050, but my Tacotron2 model will have 16000, do you think this could be a problem?

lucalazzaroni commented 3 years ago

Update: I'm currently at epoch 200, training loss is around 0.25 but validation loss stays between 5 and 9 from the beginning... I suspect a certain overfitting. Any suggestion? As @blisc suggested I only changed the sample rate in my config but the model seems not to learn if I look at the losses graphs

blisc commented 3 years ago

Training/evaluation loss for tacotron2 is not really informative. All you care about for tacotron2 is the attention plot. For more info: see https://github.com/NVIDIA/NeMo/issues/282

lucalazzaroni commented 3 years ago

Training/evaluation loss for tacotron2 is not really informative. All you care about for tacotron2 is the attention plot. For more info: see #282

Sorry @blisc for the stupid question, but how can I obtain the attention plot? My tensorboard logs only give me info about training/validation loss, epochs and learning rate. I used the tacotron2.py script. Also, I didn't find anything on that in the documentation.

blisc commented 3 years ago

You should have an images tab in tensorboard like it is shown here: https://www.tensorflow.org/tensorboard. If you don't have it, try updating your tensorboard.

lucalazzaroni commented 3 years ago

Hi @blisc , thank you again for your support. My 800 epochs training on the 16 kHz MAILABS dataset ended tonight. However, it results in a very deep voice although Italian pronunciation is very good. By looking on old issues I found this which seems very similar to my problem, but I checked and fmax on my config is set to 8000. The final alignment plot is also very similar. I don't know if it is due to the vocoder trained at 22050 Hz while my model at 16000. In fact, I noticed that if I set the audio output to 22050 Hz, voice becomes less deep, but speaking is too fast.

Schermata 2021-07-08 alle 10 46 34

OSSome01 commented 3 years ago

can I fine tune from a pre-existing English model?

I'm not aware of any research that has tried this. It might be possible if you use a phoneme-based model, but I'm not sure if we have any such models in NeMo. Maybe TalkNet is phone-based.

I am working on a similar project, involving a low-resource language. I was looking at Talknet for implementing the same. As far as I understand talknet is a grapheme-based model, it uses a grapheme duration predictor to predict what the duration of each grapheme should be in the output. Since I want to work with a different language, I think converting the input text to graphemes would be quite difficult. Can someone pls share how can I proceed with the same. Also are there any other models with the functionality of Talknet but which use phonemes in their input. Thanks in advance

blisc commented 3 years ago

Hi @blisc , thank you again for your support. My 800 epochs training on the 16 kHz MAILABS dataset ended tonight. However, it results in a very deep voice although Italian pronunciation is very good. By looking on old issues I found this which seems very similar to my problem, but I checked and fmax on my config is set to 8000. The final alignment plot is also very similar. I don't know if it is due to the vocoder trained at 22050 Hz while my model at 16000. In fact, I noticed that if I set the audio output to 22050 Hz, voice becomes less deep, but speaking is too fast.

Right, you might have to retrain a vocoder. I would recommend training with HiFiGAN. You can try using Griffin-Lim to see if it is an issue with the vocoder or if it is another issue.

I am working on a similar project, involving a low-resource language. I was looking at Talknet for implementing the same. As far > as I understand talknet is a grapheme-based model, it uses a grapheme duration predictor to predict what the duration of each grapheme should be in the output. Since I want to work with a different language, I think converting the input text to graphemes would be quite difficult. Can someone pls share how can I proceed with the same. Also are there any other models with the functionality of Talknet but which use phonemes in their input. Thanks in advance

TalkNet should work with our Phones class. All the newest TalkNet should use phones by default: https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/talknet-spect.yaml#L29-L34. Please note that we don't currently support other languages, so you have to write your own G2P or extend our G2P in NeMo to support your language. You should also switch to IPA, as we are currently using CMUdict and ARPAbet.

Cris140 commented 2 years ago

How can I implement IPA on talknet? I tried training some models in another language but it came out pretty bad with a lot of gibberish, I think it's because of arpabet.

ireneb612 commented 2 years ago

@lucalazzaroni Would you have the possibility to share your trained italian model?

lucalazzaroni commented 2 years ago

Hi @ireneb612, unfortunately I cannot share the model, but can give you some infos about it. I trained the MelGan model for 3000 epochs and the Tacotron2 for 1500 in order to achieve a good result. Major flaws are the used dataset, MAILABS, which is sampled at 16 kHz, and the audio samples which come from 1900s audiobook so pronunciation is somehow solemn. I attach an example of the final result (I had to convert it from wav to mp4 since github does not support audio files attachments). Hope this can help you! https://user-images.githubusercontent.com/26137413/146786552-b826ea62-0ec2-4711-a1c6-e96984515e92.mp4

.

ireneb612 commented 2 years ago

@lucalazzaroni did you try any end-to-end model?

lucalazzaroni commented 2 years ago

@ireneb612 no, I only tried tacotron2 + Melgan. I chose Melgan since without training it seemed the less worse among all the available models

harrypotter90 commented 2 years ago

@lucalazzaroni : Can you please how much data did you use?
And if I am correct you have used 16KHz?


Updated : Was it a single speaker dataset, can we use a multi-speaker training dataset, like Mozilla common voice?

lucalazzaroni commented 2 years ago

Hi @harrypotter90 , I used the male portion of the M-AILABS dataset which is around 18 hours, 16 kHz, single speaker. I don't think you can use multispeaker datasets since all TTS models should be thought for single voice training. Anyway I used Tacotron2 (1500 epochs) and MelGAN (3000 epochs) and obtained good results. The problem lies in the dataset where the pronunciation of the sentences is too solemn.

harrypotter90 commented 2 years ago

Thank you @lucalazzaroni for guiding.

Another question, is there a special dataset for Vocoder or could I use the same dataset for train MelGAN vocoder?

At the moment I have just trained Tacotron2.

lucalazzaroni commented 2 years ago

Hi @harrypotter90 , I used the same dataset for both Vocoder and Tacotron2, but since the M-AILABS dataset has 0.5 seconds of silence at the beginning and at the end of each file, I removed these parts using ffmpeg. Anyway, if I remember well, you should also be able to specify the silences at the beginning and end inside the config file without having to put your hands to the wav files