Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Is this good result #200

Open antontc opened 6 years ago

antontc commented 6 years ago

Hello. I trained tacotron and WaveNet on small dataset. Tacotron is about 100k, wavenet is about 130k. I use mulaw-quantize and GTA. Is this good result for small dataset or I just need train more? Or it is just some bugs in code? wavenet-audio-mel-17.wav.zip wavenet-waveplot-mel-17

alexdemartos commented 6 years ago

Hi @antontc,

this is not a good result, I bet you can make your models sound much better even if your corpus is really small.

One thing that I found useful for training with low resources (e.g. building an english TTS):

Keep in mind that you also have to toy around with the hparams to find the better config for your case. In addition, I think WaveNet needs to be trained longer than 130k even if you are using mulaw-quantize (but someone might correct me here if I'm wrong).

PD: It is also useful to synthesize some eval sentences with G&L before training your WaveNet, just to make sure your Tacotron model has converged successfully. G&L wavs should be quite close to your desired output but with some metallic-robotic style.

cliuxinxin commented 6 years ago

@alexdemartos Hello , thank you for your reply , it helps a lot. Fine-tone just means train with my datasets 150k? No more extra settings?

alexdemartos commented 6 years ago

@cliuxinxin by fine-tune I mean to train a few extra steps an already existing (trained) model with different/new data.

In this case, what I meant is to:

  1. Train a new (from scratch) Tacotron model for ~200k steps with a large dataset.
  2. Train the resulting model from step 1 with your low-resource data for some steps (depending on the size of your dataset to avoid overfitting).

In my case, I found 10k-20k iters to be enough for the model to switch from the LJSpeech female voice to a male english voice (0.8h training data)

cliuxinxin commented 6 years ago

@alexdemartos Thank you for your reply. It really helps a lot . Give me a big clue.

antontc commented 6 years ago

@alexdemartos you said "In my case, I found 10k-20k iters to be enough for the model to switch from the LJSpeech female voice to a male english voice (0.8h training data)". Did you retrain Tacotron or Wavenet or both model? And if I understand I should train tacotron on some large dataset, but wavenet I should train on my small dataset?

Thien223 commented 6 years ago

@alexdemartos Does it really work?

Could you please post some output samples?

Your method is so attractive. I'm finding the way to train Tacotron with as few as possible training input. But 1 hour of training data results meaningless audio (the voice is acceptable).

m-toman commented 6 years ago

That works pretty well and is more or less just "regular" transfer learning, even with only a couple hundred sentences - tried it briefly with this repo + r9y9s wavenet (train both on LJ, then both a bit more on the target data). Just recently published a quick paper on that: https://isca-speech.org/archive/Interspeech_2018/abstracts/1316.html although unfortunately not for Taco+Wavenet ;)