Embeddings for multi-word expressions

kyoungrok0517 commented 6 years ago

Hello, I'm curious how I can get embeddings for multi-word expressions. For instance, from "George Washington is a president." I want to get the embedding for "George Washington". Since the paper the flair is based on is contextual string embeddings, which claims to treats text as a sequence of characters, not words, I thought getting such embeddings would be straightforward (for instance, sentence.get_embedding("George Washington")), but it seems not.

stefan-it commented 6 years ago

@kyoungrok0517 Here's a nice paper from a former professor of mine that shows how to construct phrase embeddings: http://www.aclweb.org/anthology/P14-3006.

Here are some examples of nearest neighbors for some phrases:

grafik

I'm not sure if that's possible with the current architecture of flair, but methods (like mentioned in the paper above) definitively exists for that kind of problem :)

kyoungrok0517 commented 6 years ago

@stefan-it Well thanks! But what I thought is such phrase embeddings are possible 'natively' without extra measures like you just introduced. Hope to get some answers from the authors :)

alanakbik commented 6 years ago

@kyoungrok0517 thanks for your interest! You are correct that you can get embeddings for MWEs directly from the character language model and they are likely to be meaningful. We haven't done any analysis ourselves but would be very interested in what you find!

Flair currently does not include a "convenience method" for extracting embeddings for arbitrary MWEs, but I think we can add this for the next release (0.3).

Until then, you can use the code snippet below to extract MWEs. I haven't tested yet but I think it should work. The trick is to take the last character state (i.e. last word in MWE) of the forward LM and the first character state (i.e. first word in MWE) of the backward LM and concatenate both to get the embedding. Here's the code:

import torch

from flair.data import Sentence, Token
from flair.embeddings import CharLMEmbeddings, StackedEmbeddings

# stacked embedding consisting of forward and backward LM
forward_lm = CharLMEmbeddings('news-forward')
backward_lm = CharLMEmbeddings('news-backward')

embedding = StackedEmbeddings([
    forward_lm,
    backward_lm,
])

# your sentence
sentence = Sentence('George Washington is a president')

# embed sentence with forward and backward LM
embedding.embed(sentence)

# begin and end of MWE you want to embed
mwe_begin_token = sentence[0]
mwe_end_token = sentence[1]

# helper method to get MWE embedding
def get_embedding_for_mwe(
        begin_token: Token,
        end_token:Token,
        forward_lm_name:str,
        backward_lm_name:str):

    print('getting embedding for MWE: [{} ... {}]'.format(begin_token.text, end_token.text))

    # get first character state of backward LM
    first_character_state = begin_token._embeddings[backward_lm_name]

    # get last character state of forward LM
    last_character_state = end_token._embeddings[forward_lm_name]

    # concatenate both for final embedding
    return torch.cat([first_character_state, last_character_state])

mwe_embedding = get_embedding_for_mwe(
    mwe_begin_token,
    mwe_end_token,
    forward_lm.name,
    backward_lm.name
)

print(mwe_embedding.size())

Does this work for you?

kyoungrok0517 commented 6 years ago

Great thanks! It seems like the code works without error. I will test further in performance-wise then report if I encounter any unexpected behavior.

felipenv commented 5 years ago

@alanakbik Was this implemented in some recent version, or still needed to do it "manually" from forward and backward lm separately as above? It would be interesting if possible to have some tokenization that already generates the Multi-word tokens and would get the embeddings automatically.

thank you.

alanakbik commented 5 years ago

@felipenv this is not yet implemented, i.e. you would still need to use the above code. Though as of Flair 0.4.4 you can use different tokenizers, so you could implement your own tokenizer that keeps the multi-words that you want as single tokens and so get the embeddings automatically.

flairNLP / flair

Embeddings for multi-word expressions #100