I have done a lot of training on different self-made datasets (typically having around 3 hours of audio across a few thousand .wav files, all 22050 Hz) using Tacotron 1 and 2, starting from a pretrained LJSpeech model (using the same hyperparameters each time and to a similar number of steps) and am very confused why for some datasets the output audio ends up being very clear for many samples - sometimes even indistinguishable from the actual person speaking - and for other datasets the synthesised audio always has choppy aberrations. In all my datasets there is no beginning/ending silence, transcriptions are all correct, and the datasets have fairly similar phenome distributions.
To take an example from publicly available datasets: on https://keithito.github.io/audio-samples/ one can hear that the model trained on the Nancy Corpus sounds significantly less robotic and is clearer than the model trained on LJ Speech. Here https://syang1993.github.io/gst-tacotron/ is samples for a model trained on Blizzard 2013 on tacotron with extremely good quality compared to any samples I've heard from a model trained on LJ Speech using Tacotron, even though the Blizzard 2013 dataset used there is smaller than LJ Speech. Why might this be?
I have done a lot of training on different self-made datasets (typically having around 3 hours of audio across a few thousand .wav files, all 22050 Hz) using Tacotron 1 and 2, starting from a pretrained LJSpeech model (using the same hyperparameters each time and to a similar number of steps) and am very confused why for some datasets the output audio ends up being very clear for many samples - sometimes even indistinguishable from the actual person speaking - and for other datasets the synthesised audio always has choppy aberrations. In all my datasets there is no beginning/ending silence, transcriptions are all correct, and the datasets have fairly similar phenome distributions.
To take an example from publicly available datasets: on https://keithito.github.io/audio-samples/ one can hear that the model trained on the Nancy Corpus sounds significantly less robotic and is clearer than the model trained on LJ Speech. Here https://syang1993.github.io/gst-tacotron/ is samples for a model trained on Blizzard 2013 on tacotron with extremely good quality compared to any samples I've heard from a model trained on LJ Speech using Tacotron, even though the Blizzard 2013 dataset used there is smaller than LJ Speech. Why might this be?
Any comments appreciated.