Regarding Model architecture

as-ideas / TransformerTTS

🤖💬 Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.

https://as-ideas.github.io/TransformerTTS/

Other

1.13k stars 227 forks source link

Regarding Model architecture #59

Closed bkumardevan07 closed 4 years ago

bkumardevan07 commented 4 years ago

Hey, I am new to tts and transformers. I looked into your code and found that you are using mel finear to cmpute stop prob while in paper they have used decoder output for predicting stop prob. Can you explain me? Please coorect me if I am wrong. Basically i saw Postnet code and infering from there.

cfrancesco commented 4 years ago

What paper are you looking at? Here they use mel linear for stop prob. https://arxiv.org/pdf/1809.08895.pdf

3.7 Mel Linear, Stop Linear and Post-net Same as Tacotron2, we use two different linear projectionsto predict the mel spectrogram and the stop token respec-tively, and use a 5-layer CNN to produce a residual to refinethe reconstruction of mel spectrogram

bkumardevan07 commented 4 years ago

From the model architecture in paper (in fig 3 ) you will see they are projecting decoder output to Mel linear and decoder output is projected to stop linear not the output of Mel linear.

cfrancesco commented 4 years ago

Ah yes, you're right. The main difference here is that I use a time-wise reduction factor during training, so there is the need for an additional linear layer to project the decoder output to the original length, so I use that directly as linear mel. Replicating the paper a 100% would not be possible in this case, but one could add an additional linear layer after the projection to predict the linear mel and use the projection, as it is now, to predict the stop prob. I'm not sure this would make a difference but is for sure a discrepancy. Thanks for pointing it out!

bkumardevan07 commented 4 years ago

Thanks ...