MycroftAI / mimic2

Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito.
Apache License 2.0
581 stars 103 forks source link

Other considerations for what makes a good TTS dataset? #26

Open wanshun123 opened 5 years ago

wanshun123 commented 5 years ago

I have done a lot of training on different self-made datasets (typically having around 3 hours of audio across a few thousand .wav files, all 22050 Hz) using Tacotron, starting from a pretrained LJSpeech model (using the same hyperparameters each time and to a similar number of steps) and am very confused why for some datasets the output audio ends up being very clear for many samples - sometimes even indistinguishable from the actual person speaking - and for other datasets the synthesised audio always has choppy aberrations. In all my datasets there is no beginning/ending silence, transcriptions are all correct, and the datasets have fairly similar phenome distributions (and similar character length graphs) according to analyze.py in this repo (thanks for making that by the way).

To take an example from publicly available datasets: on https://keithito.github.io/audio-samples/ one can hear that the model trained on the Nancy Corpus sounds significantly less robotic and is clearer than the model trained on LJ Speech. Here https://syang1993.github.io/gst-tacotron/ is samples for a model trained on Blizzard 2013 on tacotron with extremely good quality compared to any samples I've heard from a model trained on LJ Speech using Tacotron, even though the Blizzard 2013 dataset used there is smaller than LJ Speech. Why might this be?

Any comments appreciated.

el-tocino commented 5 years ago

This echoes some of what you've noticed: https://www.reddit.com/r/MachineLearning/comments/a90u3t/d_what_makes_a_good_texttospeech_dataset/ (see comment by erogol) Also of note is using the waveglow/wavernn tools can help smooth things out as well.