flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.88k stars 2.1k forks source link

how to get a sentence representation using bert #529

Closed omerarshad closed 5 years ago

omerarshad commented 5 years ago

how can we extract a final sentence embedding using bert?

alanakbik commented 5 years ago

Right now, the only way is to use one of our DocumentEmbeddings classes for this and pass BertEmbeddings to them. At some point, we will also add a dedicated BERT class for document embeddings.

alanakbik commented 5 years ago

Closing for now, but feel free to reopen if you have more questions!

pannous commented 5 years ago

Oh that's sad, the DocumentPoolEmbeddings' mean/min/max operations remove most of the contextual goodness of Bert.

alanakbik commented 5 years ago

You could use the DocumentRNNEmbeddings if you want to train them for a task. But I agree and we should add a dedicated DocumentBertEmbeddings class for version 0.5.

neildhir commented 5 years ago

To clarify my understanding on this issue; the idea is that we want a unified embedding for an entire sentence, rather than the per-word embedding we currently get with e.g.

# Init embedding
embedding = ELMoEmbeddings('original')
# Create a sentence
sentence = Sentence('The grass is green .')
# Embed words in the sentence
embedding.embed(sentence)

which generates:

[Sentence: "The grass is green ." - 5 Tokens]

where each token has size 1x1024. Whereas Omer is asking if we can get the entire sentence embedded into a space of e.g. 1x1024 using ELMo?

alanakbik commented 5 years ago

You can use a combination of DocumentEmbeddings and ELMoEmbeddings to accomplish sentence embeddings in a 1024 vector with ELMo, like this:

from flair.data import Sentence
from flair.embeddings import DocumentRNNEmbeddings, ELMoEmbeddings

# ELMo word embeddings
elmo_embedding = ELMoEmbeddings()

# Document embeddings
document_embeddings = DocumentRNNEmbeddings([elmo_embedding], hidden_size=1024)

# create an example sentence
sentence = Sentence('I love Berlin')

# embed the sentence with our document embedding
document_embeddings.embed(sentence)

# now check out the embedded sentence.
print(sentence.get_embedding())

Note that this only works if you can train the DocumentRNNEmbeddings in a downstream task (they are randomly initialized by default, i.e. nonsensical if untrained). If you cannot train them, you should use the DocumentPoolEmbeddings instead.

neildhir commented 5 years ago

Hey Alan, thanks for that answer, very informative.

On the topic of training then; supposing I have a binary classification task (i.e. my sentences can be either one label or another), and I have about 1000 training examples. Will dynamic embeddings (i.e. those that require training) actually work given that the amount of data that I have in my particular task, is so small?

alanakbik commented 5 years ago

With so little data I'm not sure if you can train a good RNN, but if the task is very easy it might work. I would probably try the DocumentPoolEmbeddings first. But with so little data, easiest would be to compare both methods and see which one works best :)

neildhir commented 5 years ago

That's what I thought; thanks for the advise though, I'll give both a shot!

neildhir commented 5 years ago

Presumably if I use DocumentRNNEmbeddings with minimal settings on these parameters (and others):

:param hidden_size: the number of hidden states in the rnn.
:param rnn_layers: the number of layers for the rnn.

then there won't be as many parameters for the model to learn, so consequently (?) a smaller training-set may be permissible?

kshitij12345 commented 5 years ago

@wagglefoot Did you try both {DocumentPoolEmbeddings and DocumentRNNEmbeddings of them? If yes could you let me know which one worked better?

alanakbik commented 5 years ago

@wagglefoot @kshitij12345 as of Flair 0.4.2, the DocumentPoolEmbeddings have become more powerful as they now allow you to train word embedding maps before pooling. This simple 'FastText' approach can yield very strong baselines. The default operation is 'linear' transformation, but if you only use simple word embeddings that are not task-trained you should probably use a 'nonlinear' transformation instead:

# instantiate pre-trained word embeddings
embeddings = WordEmbeddings('glove')

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='nonlinear')

If on the other hand you use word embeddings that are task-trained (such as simple one hot encoded embeddings), you are often better off doing no transformation at all. Do this by passing 'none':

# instantiate one-hot encoded word embeddings
embeddings = OneHotEmbeddings(corpus)

# document pool embeddings
document_embeddings = DocumentPoolEmbeddings([embeddings], fine_tune_mode='none')

You could also combine several embeddings, but if one of them is task-trained (such as OneHotEmbeddings), you should set fine_tune_mode='none'.

nico-unity commented 5 years ago

Taking the mean of each word vector should give a good representation of your sentence.