flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.89k stars 2.1k forks source link

Setting token embeddings on `Sentence` when using a classifier #2469

Closed DeNeutoy closed 2 years ago

DeNeutoy commented 3 years ago

A clear and concise description of what you want to know.

Hello! Thank you for the excellent library, in particular the zero shot model. I am constantly impressed by the detailed and useful work coming out of Zalando research. I was wondering - I would like to retain the token embeddings of the sentence when using the Flair Classifiers - is there a flag for this?

e.g


tars = TARSModel.load("tars-ner")
sent = Sentence("My example sentence.")

tars.predict(sent)

# This is an empty tensor, but I would like it to be populated with the
# detatched contextual word vectors that the model uses
sent[0].embedding

I tried passing store_embedding_mode="cpu" as a flag to predict, but this seemed not to change the result. Is this possible?

Additionally - I would be happy to contribute a proper docs page, which can be deployed on github, as I have done it before for AllenNLP. A first version of this should be straightforward as it can be produced from markdown files, and your current tutorials are in markdown format.

Thanks!

alanakbik commented 3 years ago

Hello @DeNeutoy yes, but this currently only works for non-TARS classes, like TextClassifier or SequenceLabeler. The trick is like you wrote to set the embedding_storage_mode to CPU or GPU.

from flair.data import Sentence
from flair.models import TextClassifier, SequenceTagger

# -----------------------------------------------------------
# Example 1: Standard Flair Classifier
# -----------------------------------------------------------
classifier: TextClassifier = TextClassifier.load("sentiment-fast")

# predict on example sentence with embedding storage mode set to "cpu"
sentence = Sentence("Very positive sentiment.")
classifier.predict(sentence, embedding_storage_mode="cpu")

# print embedding (works, prints embedding)
print(sentence.embedding)

# -----------------------------------------------------------
# Example 2: Standard Flair Sequence Tagger
# -----------------------------------------------------------
tagger: SequenceTagger = SequenceTagger.load("ner-fast")

# predict on example sentence with embedding storage mode set to "cpu"
sentence = Sentence("Very positive sentiment.")
tagger.predict(sentence, embedding_storage_mode="cpu")

# print embedding of first token in sentence (works, prints embedding)
print(tagger[0].embedding)

However, for TARS-like classes this currently doesn't work. The reason is that TARS models produce multiple embeddings for each sentence, i.e. one embedding for each combination of a sentence and the label to be predicted. We'd need to adapt the logic such that these embeddings get written back into the original sentence. Is this something you need?

Regarding the other issue: proper docs would be great and we'd appreciate any help! Would it be docs like this?

DeNeutoy commented 3 years ago

Ah, I see, that's kind of obvious in retrospect.

What I am doing is trying to create entity representations using the start/end concatenation of contextual word vectors for zero/few-shot predicted entities, with the idea to cluster them afterward. I think in order to do this I would want the word representations of the forward pass of the class that a given span was predicted as - but this does seem a little memory heavy to store all of them (one per class per word). Perhaps there could also be a way to aggregate over these? E.g taking the avg of the word vectors? Perhaps this doesn't make sense.

Docs are in #2473 ! LMK if you have any questions.

alanakbik commented 3 years ago

Intuitively I'd say taking the vector of the class it was predicted was might be best.

That's not directly supported currently, but you could build a brute force solution where you first predict all entities, then construct TARS-like sentences for each predicted entity in its context to get the appropriate embeddings in a second step:

from flair.data import Sentence
from flair.models import TARSTagger

# an example sentence
sentence = Sentence("Berlin lies at the Spree")

# use TARS to predict some classes
tagger: TARSTagger = TARSTagger.load('tars-ner')
tagger.add_and_switch_to_new_task(task_name="zero-shot-example",
                                  label_type="ner",
                                  label_dictionary=["river", "city", "product"])
tagger.predict(sentence)

# we predicted "Berlin" as city and "Spree" as river
print(sentence)

# go through predicted entities, create for each a TARS-sentence, embed, and get embedding
for entity in sentence.get_spans("ner"):

    print(f"\nLets get the embedding for: {entity}")

    # make a TARS-formatted sentence
    tars_formatted_sentence = tagger._get_tars_formatted_sentence(entity.tag, sentence)
    print(f" - TARS-formatted sentence: '{tars_formatted_sentence.to_original_text()}'")

    # use the tars embeddings to embed this sentence
    tagger.tars_embeddings.embed(tars_formatted_sentence)

    # retrieve the entity again in the TARS-formatted sentence
    for tars_entity in tars_formatted_sentence.get_spans("tars_label"):
        print(f" - Entity in TARS sentence: {tars_entity}")
        print(f" - Embedding (of first token): {tars_entity[0].embedding}")

And thanks for the docs! Some links are broken since the tutorials link each other, but we'll use this as a starting point to create nicer docs.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.