MelNet & other dataset than ljspeech

TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

Apache License 2.0

3.85k stars 814 forks source link

Hello, thank you for this exellent and very intuitive implementation!

I am not familiar with TTS research, so my questions can probably be quite naive. :)

1) Do you also plan to implement MelNet? The audio results provided in the paper overview are quite impressive.
2) Is there any chance that in the long run you will train models with different dataset, other than ljspeech? This implementation is great, but the commercial applications (Google, Microsoft) have models trained with much better datasets.

P.S: As I said I am not familiar with TTS research, but there is plenty of exellent readers on librivox - all of them being in public domain. I have done plenty of audio-text matching tasks with aeneas library in the past. Would it be enough to emulate the structure of ljspeech dataset? With the quantity and lenght of samples in librivox I could probably create a dataset with 30-50 single-speaker hours of samples...

Once again thanks for your great work!

Hi,

Do you also plan to implement MelNet? The audio results provided in the paper overview are quite impressive.

I will read the paper and consider to implement it later.

Is there any chance that in the long run you will train models with different dataset, other than ljspeech? This implementation is great, but the commercial applications (Google, Microsoft) have models trained with much better datasets.

Actually, in the future i want to try other datasets, but now i just only focus on LJSpeech for fast experiments :D.

As I said I am not familiar with TTS research, but there is plenty of exellent readers on librivox - all of them being in public domain. I have done plenty of audio-text matching tasks with aeneas library in the past. Would it be enough to emulate the structure of ljspeech dataset? With the quantity and lenght of samples in librivox I could probably create a dataset with 30-50 single-speaker hours of samples.

yeah, it's enough :D, maybe you can refer libriSpeech dataset for TTS here (http://www.openslr.org/60/)

TensorSpeech / TensorFlowTTS

MelNet & other dataset than ljspeech #64