About the computation of logits after decoder

Hi @LTH14 ,

Great work! Here I have a question about the computation of logits after decoder. I find that a MlmLayer is used. The output of decoder is mapped by a fc layer, and then the dot product between the mapped features and word embeddings are obtained, which is added with bias and used as the logits.

Have you tried to get the logits directly using a fc layer (upon the output feature of decoder)? What's the main difference between these two types of logits? And which one do you think is better?

Thanks.

LTH14 / mage

About the computation of logits after decoder #20