Applying an LM to some text / retrieve concrete contextualized word embedding

flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

https://flairnlp.github.io/flair/

Other

13.88k stars 2.1k forks source link

Applying an LM to some text / retrieve concrete contextualized word embedding #645

Closed khituras closed 5 years ago

khituras commented 5 years ago

Hi FLAIR team and users,

Given a document, I would like to compute the contextualized word embedding for arbitrary words of the document. How can I do this with FLAIR? I successfully trained language models and also trained sequence taggers with them but I didn’t yet see how to compute the embedding individually.

Thanks!

Erik

alanakbik commented 5 years ago

Hi @khituras - if I understand your question correctly, you should first construct Sentence objects for your documents and then embed them with our embedding classes. Then you can iterate through the Token objects of each Sentence and retrieve the embedding:

# create sentence for document.
sentence = Sentence('This is my document .')

# init embedding
flair_embedding = FlairEmbeddings('news-forward')

# embed the document.
flair_embedding.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

You can modify this code snippet with the embeddings you want to use. We generally recommend stacking different embedding types to get better embeddings. Hope this helps!

khituras commented 5 years ago

Thank you very much @alanakbik, this was exactly what I was looking for. So I need to input the sentences, which absolutely makes sense for contextualized embeddings. Lets assume I am only interested in the embedding vector of a single word in the middle of my very long document. Should I aways input all sentences of the - very long - document or would it suffice to only input the sentences in the vicinity of the word? I would do this to save computational cost, of course. Do you have any experience with how much context is required to get a good embedding vector for a specific word?

Or should I generally stick to the kind of context that the language model was trained on in the first place (which would make sense)?

alanakbik commented 5 years ago

Hello @khituras yes if you only care about the embedding of one of the words in a long document you can do a word window around it. My recommendation would be to use the full sentence in which the word appears (most LMs are trained sentence-level anyway). If this is still too long you could use a smaller window, but this could impact embedding quality - but I never tried it so I cannot say by how much. Hope this helps!

khituras commented 5 years ago

Thank you so much for all your support, @alanakbik ! This is all the information I need for now. I am curious about the results. I will close this for now as my questions have been answered, thanks again!

alanakbik commented 5 years ago

Great - we're also curious about your results so please share if you can! :)