Open jiqiujia opened 5 years ago
hmmm? what do you mean the output embedding? you mean the softmaxed output distribution?
The output embedding is linear layer in MaskedLanguageModel
. I made a mistake: the output embedding along each token is already shared. It should be easy to tie the input embedding and output embedding.
Is there any benefit if we bind two layer weight? If it is, please can you let me know some references which has similar architecture?
Here's a paper: https://arxiv.org/abs/1608.05859
With tying there is a lower memory requirement and the training should be faster (i believe).
@jiqiujia @briandw Cool I'll implement is on 0.0.1a5 version, but it seems like solving #32 is more high priority
I think it's reasonable to tie the input and output embedding. Especially the output embedding along each token. But I still can't get a way to do this. Any one give an idea?