About reduction_factor_schedule

taylorlu commented 3 years ago

Hi, thanks for sharing this great work. I want to ask the training skill why we need to use dynamic input length in decode module, the relative variables self.max_r and self.r can be found in models.py. The purpose seems to let it harder to train in the beginning since we only use less data to predict the whole mel sequence, but getting easier when the reduction_factor_schedule changes smaller which indicates larger input length. It looks a bit like simulated annealing algorithm, does it really work as I described? What will happen when self.max_r and self.r not the same.

myagues commented 3 years ago

Your intuition is right! You will use large values for reduction_factor, at the start of the training, because missing data will make the model to rely on attention alignments. You can also imagine it as a type of dropout for auto-regressive models. Then, you can begin lowering the reduction_factor, which will improve the predicted mel spectrogram details, because the model will have more information. Here is an explanation using a Tacotron2 model.

What will happen when self.max_r and self.r not the same.

You need your model layers to be shape static, so you will initialize your projection output with the largest value of reduction_factor_schedule: https://github.com/as-ideas/TransformerTTS/blob/e4ded5bf5a488aab98ce6aee981e3ac0946f4ddc/model/models.py#L83

When you reduce the value of self.r during your training, your layer will be the same size, but you will select just a part of it: https://github.com/as-ideas/TransformerTTS/blob/e4ded5bf5a488aab98ce6aee981e3ac0946f4ddc/model/models.py#L148-L151

taylorlu commented 3 years ago

Thanks for your elaboration.

cfrancesco commented 3 years ago

Thank you @myagues, excellent explanation.

as-ideas / TransformerTTS

About reduction_factor_schedule #79