as-ideas / ForwardTacotron

⏩ Generating speech in a single forward pass without any attention!
https://as-ideas.github.io/ForwardTacotron/
MIT License
578 stars 113 forks source link

Is it possible to pretrain on a different speaker? #20

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hi,

ForwardTacotron works very well on long articles it is just surprising and welcome. With a good dataset and a lot of training, it is incredible how robust it can get.

Two questions:

Thank you for all this work, it is just incredible how well it performs in whole pages of books!

cschaefer26 commented 4 years ago

Hi, nice to hear. To the questions:

  1. Since we do not have a speaker embedding implemented it would be necessary to prepare both datasets using separate tacotrons trained on each. I have found though that it is possible to retrain a tacotron model that has built up attention on a different dataset (in case the second dataset is too small for the model to build attention). Once you have both datasets you could try to train ForwardTacotron on the first dataset and then retrain it on the second one - I never tried that though and no idea if it helps...

  2. I have found it much easier to train a RAW model than MOL, i.e. the MOL models usually show large fluctuations in quality and need to be cherry-picked quite well. Personally, I could not really hear much difference between both models anyways. The shakyness is much a matter of cherry-picking the model in my experience (you could look through the top 5 models or so in tensorboard). Also, I found that the shakiness is often present for unseen words or ambigous ones.