Word Embeddings out of LASER

facebookresearch / LASER

Language-Agnostic SEntence Representations

Other

3.59k stars 461 forks source link

Word Embeddings out of LASER #69

Closed harsh306 closed 5 years ago

harsh306 commented 5 years ago

What is the best way to get the word embeddings out of LASER instead of Sentence embeddings?

We may use the last BiLSTM Layer's output (just before max pooling) as word-level representations. My understanding is Encode.forward() at https://github.com/facebookresearch/LASER/blob/master/source/embed.py#L196 ; can be modified to provide such a utility.

vseledkin commented 5 years ago

@harsh306 doing this way you’ll get subword embeddings so to get word embedding we also need some function over set of embeddings to map it to single word embedding, i can suggest that using max pooling again is probably the best choice because model also uses it and max pooling thus becomes “natural choice” but i d prefer to hear comments from laser developers

hoschwenk commented 5 years ago

LASER was trained to get sentence embeddings. There may be possibilities to get something like a word embedding, but the system was not trained for it and there are probably suboptimal. The output's of the last BiLSTM layer could be qualified as contextualized word embeddings. Actually, these are not at the word, but BPE level, as the input to the LASER encoder. If you are only interested in word embeddings, I would recommend using another approach which was developed and trained for this. There are a couple of choices