improving audio quality? (4-speaker)

Hi, we've been experimenting TTS in two setups:

1) tacotron2 for mel-spectrogram + vocoder (4-speaker) 2) tacotron2 for linear-spectrogram (directly, without post-net) + griffin-lim (4-speaker)

1) sounds better than 2), but it's still not as perfect as human ground-truth speech. (it still has very slight vibrating noise) (i've uploaded samples below, please check out)

Any ideas on how to improve these further (in terms of post-processing or whatever)? we're pretty sure each component of each setup is trained pretty enough. our data is about 25 hours in total.

Thanks in advance! samples.zip

Rayhane-mamah / Tacotron-2

improving audio quality? (4-speaker) #478