Closed NoamDev closed 4 years ago
What problem are you experiencing?
It's just that it doesn't sound as good as in google's sample. Google trained their model with 22 recording hours, but in the README here, it says the linked examples were generated from a model trained with only 16 hours of recording. Could more training data make the model better? Has anyone tried such thing?
24.6 for tacotron1/2.
The voices you hear from google assistant all run through a vocoder (parallel wavenet or something) as well. They've updated their tts engine with other adjustments; unfortunately, they do not release code and the papers they publish have to be reverse engineered.
To answer your question, there's some small improvements that could be made with more data, possibly. I suspect a better option will be to move to a tacotron2-based tts engine with vocoder instead.
I understand, thanks!
If you're super interested, they have an archive of their papers: https://google.github.io/tacotron/
Is it possible the main problem is lack of data? In google's example they trained it with 22 hours, but the samples of mimic it was trained on only 16 hours.