SentenceTransformer API vs. Transformer API + pooling

githubrandomuser2017 commented 4 years ago

In your documentation you mention two approaches to using your package to create sentence embeddings.

First, from the Quickstart, you wrote:

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)
print(sentence_embeddings.shape)
# (3, 768)

Second, from Sentence Embeddings with Transformers, you wrote:

model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
# Model is of type: transformers.modeling_bert.BertModel

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings.shape)
# torch.Size([3, 768])

What are the important differences between these two approaches? The only thing I can see is that in the second approach, the BertModel model returns token embeddings and then you manually perform pooling (mean or max). If I use this second approach, what would I be missing from using SentenceTransformer.encode()?

nreimers commented 4 years ago

SentenceTransformer.encode() performs various optimization steps to ensure to encode the input as fast as possible. The tokenization and the embedding computation can run in parallel (if a GPU exists), further, it batches the input such that only minimal padding is needed which also improves the performance.

In general, SentenceTransformer.encode() is in my opinion way more convenient to use than using the AutoModel approach if you want to get embeddings for input texts.

githubrandomuser2017 commented 4 years ago

In your section Sentence Embeddings with Transformers, you wrote:

Most of our pre-trained models are based on Huggingface.co/Transformers and are also hosted in the models repository from Hugginface.

In the HuggingFace models repository, I see a lot of different models, including those from your sentence-transformers package:

bert-base-uncased
gpt2
deepset/roberta-base-squad2
sentence-transformers/bert-base-nli-mean-tokens

Is it possible to use any of these models, or just those with sentence-transformers? Am I correct in assuming that your models are specifically configured to return the token embeddings? That model output can then be run through a pooler function.

nreimers commented 4 years ago

In theory you could use any of them, however, out of the box, they do not produce good sentence embeddings.

The sentence-transformers models were specifically trained to produce meaningful sentence embeddings.

The other models also return token embeddings. However, when you average them, the representation does not necessarily make sense.

MathewAlexander commented 4 years ago

@nreimers When I fine-tuned the XLNET large on STS-B, I got a Pearson correlation coefficient of +0.917 on the development set. Also in the leaderboard given here and here, I see many models above 90. Doesn't that mean they are better than the sentence-transformers models?

nreimers commented 4 years ago

Hi @MathewAlexander

It depends on your use case: If you just want to compute the similarity for two sentences, then using BERT & Co. works better. For this, you don't need this package. You pass both sentences to BERT and get a score, that indicates the similarity.

However, this scales badly. Assume you have 10k sentences, and you want to find the most similar pair. 10k sentences lead to about 500k different combinations, so you would need to apply BERT cross-encoder on 500k sentence combinations, which takes quite long.

With sentence transformer, you compute an embedding for each of the 10k sentence and then perform cosine similarity. This takes only seconds.

The performance will be worse, but you get the within seconds and don't have to wait hours or even days to get the result.

MathewAlexander commented 4 years ago

Hi @nreimers That makes sense. Thanks for the explanation

githubrandomuser2017 commented 4 years ago

@nreimers

With sentence transformer, you compute an embedding for each of the 10k sentence and then perform cosine similarity.

If you compute an embedding for each sentence individually, how do update the BERT weights during training backprop? Your paper does say that you update BERT (in Section 3):

In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al., 2015) to update the weights

nreimers commented 4 years ago

As mentioned in the paper, by using siamese or triplet networks, depending on the loss.

You pass a sentence pair (or triplet) for training and a label, measure the error and do backprop

githubrandomuser2017 commented 4 years ago

@nreimers Why don't you use GPT2 as the basis of a Sentence Transformer model?

nreimers commented 4 years ago

@githubrandomuser2017 When SBERT was created, GPT2 was not available.

I never tested GPT2, but I think Mask Language Modeling as used in BERT is the better pre-training task to get sentence embeddings than the causal language model used by GPT2.

But it will be easy to fine-tune and test GPT2 with sentence-transformers.

UKPLab / sentence-transformers

SentenceTransformer API vs. Transformer API + pooling #405