Mainly as a coding practice and for learning, I'm currently working on implementing the aligner model in PyTorch.
The predicted mels during training seem to improve, however I somehow didn't get any clear diagonal alignments yet and therefore also no proper results at inference. I have not yet implemented decreasing the reduction factor over training time (going with r=10 all the way) and also not forcing the alignments at given steps - could this be a reason?
I attached the attentions from the last decoder attention at step 89000 - with some fantasy you can see a glimpse of diagonality, but also when training further (upon till ~200k) it doesn't get any better. I train on LJSpeech with the exact configs from this repo.
Mainly as a coding practice and for learning, I'm currently working on implementing the aligner model in PyTorch.
The predicted mels during training seem to improve, however I somehow didn't get any clear diagonal alignments yet and therefore also no proper results at inference. I have not yet implemented decreasing the reduction factor over training time (going with r=10 all the way) and also not forcing the alignments at given steps - could this be a reason?
I attached the attentions from the last decoder attention at step 89000 - with some fantasy you can see a glimpse of diagonality, but also when training further (upon till ~200k) it doesn't get any better. I train on LJSpeech with the exact configs from this repo.