1) tacotron2 for mel-spectrogram + vocoder (4-speaker)
2) tacotron2 for linear-spectrogram (directly, without post-net) + griffin-lim (4-speaker)
1) sounds better than 2), but it's still not as perfect as human ground-truth speech. (it still has very slight vibrating noise)
(i've uploaded samples below, please check out)
Any ideas on how to improve these further (in terms of post-processing or whatever)? we're pretty sure each component of each setup is trained pretty enough. our data is about 25 hours in total.
Hi, we've been experimenting TTS in two setups:
1) tacotron2 for mel-spectrogram + vocoder (4-speaker) 2) tacotron2 for linear-spectrogram (directly, without post-net) + griffin-lim (4-speaker)
1) sounds better than 2), but it's still not as perfect as human ground-truth speech. (it still has very slight vibrating noise) (i've uploaded samples below, please check out)
Any ideas on how to improve these further (in terms of post-processing or whatever)? we're pretty sure each component of each setup is trained pretty enough. our data is about 25 hours in total.
Thanks in advance! samples.zip