Closed bkumardevan07 closed 4 years ago
What paper are you looking at? Here they use mel linear for stop prob. https://arxiv.org/pdf/1809.08895.pdf
3.7 Mel Linear, Stop Linear and Post-net Same as Tacotron2, we use two different linear projectionsto predict the mel spectrogram and the stop token respec-tively, and use a 5-layer CNN to produce a residual to refinethe reconstruction of mel spectrogram
From the model architecture in paper (in fig 3 ) you will see they are projecting decoder output to Mel linear and decoder output is projected to stop linear not the output of Mel linear.
Ah yes, you're right. The main difference here is that I use a time-wise reduction factor during training, so there is the need for an additional linear layer to project the decoder output to the original length, so I use that directly as linear mel. Replicating the paper a 100% would not be possible in this case, but one could add an additional linear layer after the projection to predict the linear mel and use the projection, as it is now, to predict the stop prob. I'm not sure this would make a difference but is for sure a discrepancy. Thanks for pointing it out!
Thanks ...
Hey, I am new to tts and transformers. I looked into your code and found that you are using mel finear to cmpute stop prob while in paper they have used decoder output for predicting stop prob. Can you explain me? Please coorect me if I am wrong. Basically i saw Postnet code and infering from there.