Run time for LJSpeech database based tacotron2 training

TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)

https://tensorspeech.github.io/TensorFlowTTS/

Apache License 2.0

3.84k stars 815 forks source link

Run time for LJSpeech database based tacotron2 training #224

Closed jesinj closed 3 years ago

jesinj commented 4 years ago

Hello

We have been trying to train the tacotron2 based speech synthesis model from scratch using the LJSpeech database. Can anyone comment on how long it has taken for them to train this using Colab? When the training has reached approximately 2000 steps, it is taking 12 hours time. In total there are 200k steps for the training right. Are we looking at at-least a week's training from scratch then? Which is not possible even on the Google Colab Pro version. If anyone has tried this, please share your experience.

Thanks

dathudeptrai commented 4 years ago

@jesinj 200k is just maxmimum steps, if you train to extract duration for fastspeech2, you just need 40k. If you want to use it for real inference, around 80k-100k is enough. The training speed of LJSpeech on 2080Ti is 3.3s/1it without mixed precision.

ZDisket commented 4 years ago

@jesinj

Are we looking at at-least a week's training from scratch then?

Yeah, pretty much, although I think model's done training at about 120k. You can alleviate it by going to the config and turning on use_fixed_shapes which yielded an almost 2x increase in speed for me. If you're looking at something feasible to train from scratch from Colab, try looking at FastSpeech2.

jesinj commented 4 years ago

@dathudeptrai thanks a lot for your reply. Yes, it would be preferable to be able to use Colab at the moment. I will have alook into it and keep you updated with my progress.

jesinj commented 4 years ago

Finally, my aim is to use the trained model and adapt it for another variant of English. Will FastSpeech2 be suitable for this? @ZDisket @dathudeptrai

ZDisket commented 4 years ago

@jesinj What do you mean by "variant of English"? If you have enough data and a forced aligner can accurately predict durations then FastSpeech2 is worth looking into.

jesinj commented 4 years ago

I am working on New Zealand English - my database is small - single speaker - with 1000 sentences - studio-quality recordings. And I think the forced aligner will be able to align the data well.

So, I think the database may not be sufficient to train a model from scratch. Hence, I am trying to use a pre-trained model and then adapt it to my data.

ZDisket commented 4 years ago

@jesinj Why retrain your own model from scratch? Is there something preventing you from using one that's already here? There's one FastSpeech2 MFA-aligned trained on LJSpeech. The easiest way to test it is by checking out the C++ inference demo.

jesinj commented 4 years ago

We wanted to familiarize with the process of training, so that we can train it for a non-English language that we are working for as well. I think it maybe better to adapt a pre-trained model then, atleast for this English variant. I will have a look. Thanks again.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.