Open antontc opened 6 years ago
Hi @antontc,
this is not a good result, I bet you can make your models sound much better even if your corpus is really small.
One thing that I found useful for training with low resources (e.g. building an english TTS):
Keep in mind that you also have to toy around with the hparams to find the better config for your case. In addition, I think WaveNet needs to be trained longer than 130k even if you are using mulaw-quantize (but someone might correct me here if I'm wrong).
PD: It is also useful to synthesize some eval sentences with G&L before training your WaveNet, just to make sure your Tacotron model has converged successfully. G&L wavs should be quite close to your desired output but with some metallic-robotic style.
@alexdemartos Hello , thank you for your reply , it helps a lot. Fine-tone just means train with my datasets 150k? No more extra settings?
@cliuxinxin by fine-tune I mean to train a few extra steps an already existing (trained) model with different/new data.
In this case, what I meant is to:
In my case, I found 10k-20k iters to be enough for the model to switch from the LJSpeech female voice to a male english voice (0.8h training data)
@alexdemartos Thank you for your reply. It really helps a lot . Give me a big clue.
@alexdemartos you said "In my case, I found 10k-20k iters to be enough for the model to switch from the LJSpeech female voice to a male english voice (0.8h training data)". Did you retrain Tacotron or Wavenet or both model? And if I understand I should train tacotron on some large dataset, but wavenet I should train on my small dataset?
@alexdemartos Does it really work?
Could you please post some output samples?
Your method is so attractive. I'm finding the way to train Tacotron with as few as possible training input. But 1 hour of training data results meaningless audio (the voice is acceptable).
That works pretty well and is more or less just "regular" transfer learning, even with only a couple hundred sentences - tried it briefly with this repo + r9y9s wavenet (train both on LJ, then both a bit more on the target data). Just recently published a quick paper on that: https://isca-speech.org/archive/Interspeech_2018/abstracts/1316.html although unfortunately not for Taco+Wavenet ;)
Hello. I trained tacotron and WaveNet on small dataset. Tacotron is about 100k, wavenet is about 130k. I use mulaw-quantize and GTA. Is this good result for small dataset or I just need train more? Or it is just some bugs in code? wavenet-audio-mel-17.wav.zip