Is it possible to pretrain on a different speaker?

as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!

MIT License

578 stars 113 forks source link

Hi,

ForwardTacotron works very well on long articles it is just surprising and welcome. With a good dataset and a lot of training, it is incredible how robust it can get.

Two questions:

I am interested in making a TTS for a specific voice, however, I have a smaller corpus from a different speaker, which is very good in terms of vocabulary and even. In other implementations (Mozilla etc.) it is easy to pretrain the TTS using a corpus and then restore that model when kickstarting training for a new speaker. In ForwardTacotron, I am not really sure how to go about this, if even possible. Should I train Tacotron on the first speaker, or ForwardTacotron? I would like to make use of the first, smaller corpus, because I think it would greatly help with the encoder.
Regarding WaveRNN, is it true that RAW yields better results than MOLD when it comes to sound quality? I tried training it on a studio dataset and I got good results using RAW, after training for 900k steps, however one can still hear some parts where the voice is "shakier". Do you think I may achieve better results using MOLD? The dataset has no background noise, is 22050Hz and monophonic. :)

Thank you for all this work, it is just incredible how well it performs in whole pages of books!

Hi, nice to hear. To the questions:

Since we do not have a speaker embedding implemented it would be necessary to prepare both datasets using separate tacotrons trained on each. I have found though that it is possible to retrain a tacotron model that has built up attention on a different dataset (in case the second dataset is too small for the model to build attention). Once you have both datasets you could try to train ForwardTacotron on the first dataset and then retrain it on the second one - I never tried that though and no idea if it helps...
I have found it much easier to train a RAW model than MOL, i.e. the MOL models usually show large fluctuations in quality and need to be cherry-picked quite well. Personally, I could not really hear much difference between both models anyways. The shakyness is much a matter of cherry-picking the model in my experience (you could look through the top 5 models or so in tensorboard). Also, I found that the shakiness is often present for unseen words or ambigous ones.

as-ideas / ForwardTacotron

Is it possible to pretrain on a different speaker? #20