Why ELMo's word embedding can represent the word better than glove?

allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models

Apache License 2.0

1.62k stars 451 forks source link

Why ELMo's word embedding can represent the word better than glove? #139

Closed guotong1988 closed 5 years ago

guotong1988 commented 5 years ago

Based on my understanding, ELMo first init an word embedding matrix A for all the word and then add LSTM B, at end use the LSTM B's outputs to predict each word's next word.

I am wondering why we can input each word in the vocab and get the final word representation from the word embedding matrix A after training.

It seems that we lost the information of LSTM B.

Why the embedding can contains the information we want in the language model.

Why the training process can inject the information for a good word representation into the word embedding matrix A?

Thank you!

PhilipMay commented 5 years ago

From my understanding glove is a static word to vector map. ELMO / bilm is a deep contextualized word representation. It calculates a vector for a word "on runtime" and with the context of the whole sentence. Not just the word in isolation. Also because of its character n gram view it can create vectors for words it has never seen before. This is all impossible for glove - which is just a map at the end of the day.

matt-peters commented 5 years ago

The contextualized representations use the LSTM layers so that information is not lost.

pzhang84 commented 2 years ago

@PhilipMay According to you, 'It calculates a vector for a word "on runtime" and with the context of the whole sentence.' Does that also explain why ELMo produces different embeddings for one sentence if I run it several times (the embedding matrix is close though)?