Closed jesinj closed 3 years ago
@jesinj 200k is just maxmimum steps, if you train to extract duration for fastspeech2, you just need 40k. If you want to use it for real inference, around 80k-100k is enough. The training speed of LJSpeech on 2080Ti is 3.3s/1it without mixed precision.
@jesinj
Are we looking at at-least a week's training from scratch then?
Yeah, pretty much, although I think model's done training at about 120k. You can alleviate it by going to the config and turning on use_fixed_shapes
which yielded an almost 2x increase in speed for me.
If you're looking at something feasible to train from scratch from Colab, try looking at FastSpeech2.
@dathudeptrai thanks a lot for your reply. Yes, it would be preferable to be able to use Colab at the moment. I will have alook into it and keep you updated with my progress.
Finally, my aim is to use the trained model and adapt it for another variant of English. Will FastSpeech2 be suitable for this? @ZDisket @dathudeptrai
@jesinj What do you mean by "variant of English"? If you have enough data and a forced aligner can accurately predict durations then FastSpeech2 is worth looking into.
I am working on New Zealand English - my database is small - single speaker - with 1000 sentences - studio-quality recordings. And I think the forced aligner will be able to align the data well.
So, I think the database may not be sufficient to train a model from scratch. Hence, I am trying to use a pre-trained model and then adapt it to my data.
@jesinj Why retrain your own model from scratch? Is there something preventing you from using one that's already here? There's one FastSpeech2 MFA-aligned trained on LJSpeech. The easiest way to test it is by checking out the C++ inference demo.
We wanted to familiarize with the process of training, so that we can train it for a non-English language that we are working for as well. I think it maybe better to adapt a pre-trained model then, atleast for this English variant. I will have a look. Thanks again.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Hello
We have been trying to train the tacotron2 based speech synthesis model from scratch using the LJSpeech database. Can anyone comment on how long it has taken for them to train this using Colab? When the training has reached approximately 2000 steps, it is taking 12 hours time. In total there are 200k steps for the training right. Are we looking at at-least a week's training from scratch then? Which is not possible even on the Google Colab Pro version. If anyone has tried this, please share your experience.
Thanks