codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

Tie the input and output embedding? #40

Open jiqiujia opened 5 years ago

jiqiujia commented 5 years ago

I think it's reasonable to tie the input and output embedding. Especially the output embedding along each token. But I still can't get a way to do this. Any one give an idea?

codertimo commented 5 years ago

hmmm? what do you mean the output embedding? you mean the softmaxed output distribution?

jiqiujia commented 5 years ago

The output embedding is linear layer in MaskedLanguageModel. I made a mistake: the output embedding along each token is already shared. It should be easy to tie the input embedding and output embedding.

codertimo commented 5 years ago

Is there any benefit if we bind two layer weight? If it is, please can you let me know some references which has similar architecture?

briandw commented 5 years ago

Here's a paper: https://arxiv.org/abs/1608.05859

With tying there is a lower memory requirement and the training should be faster (i believe).

codertimo commented 5 years ago

@jiqiujia @briandw Cool I'll implement is on 0.0.1a5 version, but it seems like solving #32 is more high priority