Suggestions for datasets other than LJSpeech

TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

https://tensorspeech.github.io/TensorFlowTTS/

Apache License 2.0

3.82k stars 812 forks source link

Suggestions for datasets other than LJSpeech #140

Closed adfost closed 4 years ago

adfost commented 4 years ago

I was able to pinpoint the problem as being the dataset, as I could train a model that drifted away from the pretrained model back to LJSpeech. I am under the impression that my dataset, with fewer than 2000 files in the training set is too small. Are there any other similar datasets you suggest, and if so any suggestions on formatting? Thanks again for your help.

machineko commented 4 years ago

Use pretrained models as start point for your own model

adfost commented 4 years ago

No I understand that, but I'm more trying to find a suitable dataset to use, and a size requirement.

dathudeptrai commented 4 years ago

@adfost 2000 file is so small, ur model will be overfitted very fast. The dataset should at least 5->10 hours for a single speaker.

tekinek commented 4 years ago

@dathudeptrai I have a relevant question. My 25hour dateset is single speaker, but I realize that almost half the utterance have obvious background noise and their tone & space feeling are somehow different (maybe two recordings done in quite different time and soft & hardware settings).

1) Could this be a issue when training single speaker model? 2) Is it a worth try to train a multi speaker model by marking these two parts of dataset as different speakers? ( I mean, due to my limited GPU resource and long training time of tacotron, trying a new setting is quite time consuming)

Thanks.

machineko commented 4 years ago

@tekinek I have same problem with my dataset but with a lot fewer hours and training it as multi-speaker on tacotron2 (not in this repo) works, single speaker also works fine but then I've got a lot of artifacts in outputs.

But my diff in dataset was that one part of dataset have a lot of emotion in reading and I've got almost zero background noise. (I would remove every file with background noise in your dataset if u don't want to lose time on training and use pretrained model 2/3h data will be enough then)

tekinek commented 4 years ago

@machineko Thanks you. l'll give it a try: train with full dataset first and then firnetune by sub dataset that has less noise.