Closed taylorlu closed 3 years ago
Your intuition is right!
You will use large values for reduction_factor
, at the start of the training, because missing data will make the model to rely on attention alignments. You can also imagine it as a type of dropout for auto-regressive models. Then, you can begin lowering the reduction_factor
, which will improve the predicted mel spectrogram details, because the model will have more information. Here is an explanation using a Tacotron2 model.
What will happen when self.max_r and self.r not the same.
You need your model layers to be shape static, so you will initialize your projection output with the largest value of reduction_factor_schedule
:
https://github.com/as-ideas/TransformerTTS/blob/e4ded5bf5a488aab98ce6aee981e3ac0946f4ddc/model/models.py#L83
When you reduce the value of self.r
during your training, your layer will be the same size, but you will select just a part of it:
https://github.com/as-ideas/TransformerTTS/blob/e4ded5bf5a488aab98ce6aee981e3ac0946f4ddc/model/models.py#L148-L151
Thanks for your elaboration.
Thank you @myagues, excellent explanation.
Hi, thanks for sharing this great work. I want to ask the training skill why we need to use dynamic input length in decode module, the relative variables
self.max_r
andself.r
can be found inmodels.py
. The purpose seems to let it harder to train in the beginning since we only use less data to predict the whole mel sequence, but getting easier when thereduction_factor_schedule
changes smaller which indicates larger input length. It looks a bit like simulated annealing algorithm, does it really work as I described? What will happen whenself.max_r
andself.r
not the same.