as-ideas / TransformerTTS

🤖💬 Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.
https://as-ideas.github.io/TransformerTTS/
Other
1.13k stars 227 forks source link

Average training time in Google Colab with GPU #50

Open giymen opened 4 years ago

giymen commented 4 years ago

I am working on Colab, and for now, I'm trying to train the model with LJSpeech dataset. (just for trial, later I will use custom data)

I used parameters as in config files with "max_steps: 900_000" for melgan/autoregressive. It is my first TTS model experience, so I wanted to ask about training time. How many minutes/hours are expected for total training time of models?

cfrancesco commented 4 years ago

Hi, I trained the autoregressive models for about 600K steps (some less) and around the same for the forward models. This should take, if I remember correctly, about 2-3 days (on RTX 2080).

tylerweitzman commented 3 years ago

I'm getting 1.7 s/it on tts training and 5.9 s/it on aligner training on a Tesla P100 16GB on Colab

I'm trying to figure out how batch size plays into this, if I had more GPU memory for example. The config file only has bucket_batch_sizes but no batch_size so I'm not sure what batch size this is running on— I think bucket_batch_sizes is only for the aligner?

Also, it looks like my default config is different than yours @giymen https://github.com/as-ideas/TransformerTTS/blob/main/config/training_config.yaml shows a max step of 260,000 for example, not 900,000, and not 600,000, so, there may also be other things changed (say dimensions) that would impact the number of parameters and therefore the training time. @cfrancesco could you explain the batch size and the discrepancy in default training configuration? Thanks!

I seem to have three options for default training configs: 1) The current one linked above in master branch 2) The one in the colab demo commit c3405c53e435a06c809533aa4453923469081147 3) The one in the from model.factory import tts_ljspeech import, which has 260K max steps linked from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/TransformerTTS/api_weights/ljspeech_tts_config_v1.yaml

cfrancesco commented 3 years ago

Hi, batch sizes are dynamic. Samples are bucketed by duration, so the batch size depends on how many samples there are in each bin. Max sizes are specified in the bucket_batch_sizes for each interval. The max steps have been reduced over time because of more efficient training (such as the addition of diagonality loss).