Open githubrandomuser2017 opened 4 years ago
SentenceTransformer.encode()
performs various optimization steps to ensure to encode the input as fast as possible. The tokenization and the embedding computation can run in parallel (if a GPU exists), further, it batches the input such that only minimal padding is needed which also improves the performance.
In general, SentenceTransformer.encode()
is in my opinion way more convenient to use than using the AutoModel approach if you want to get embeddings for input texts.
In your section Sentence Embeddings with Transformers, you wrote:
Most of our pre-trained models are based on Huggingface.co/Transformers and are also hosted in the models repository from Hugginface.
In the HuggingFace models repository, I see a lot of different models, including those from your sentence-transformers package:
Is it possible to use any of these models, or just those with sentence-transformers
? Am I correct in assuming that your models are specifically configured to return the token embeddings? That model output can then be run through a pooler function.
In theory you could use any of them, however, out of the box, they do not produce good sentence embeddings.
The sentence-transformers models were specifically trained to produce meaningful sentence embeddings.
The other models also return token embeddings. However, when you average them, the representation does not necessarily make sense.
Hi @MathewAlexander
It depends on your use case: If you just want to compute the similarity for two sentences, then using BERT & Co. works better. For this, you don't need this package. You pass both sentences to BERT and get a score, that indicates the similarity.
However, this scales badly. Assume you have 10k sentences, and you want to find the most similar pair. 10k sentences lead to about 500k different combinations, so you would need to apply BERT cross-encoder on 500k sentence combinations, which takes quite long.
With sentence transformer, you compute an embedding for each of the 10k sentence and then perform cosine similarity. This takes only seconds.
The performance will be worse, but you get the within seconds and don't have to wait hours or even days to get the result.
Hi @nreimers That makes sense. Thanks for the explanation
@nreimers
With sentence transformer, you compute an embedding for each of the 10k sentence and then perform cosine similarity.
If you compute an embedding for each sentence individually, how do update the BERT weights during training backprop? Your paper does say that you update BERT (in Section 3):
In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al., 2015) to update the weights
As mentioned in the paper, by using siamese or triplet networks, depending on the loss.
You pass a sentence pair (or triplet) for training and a label, measure the error and do backprop
@nreimers Why don't you use GPT2 as the basis of a Sentence Transformer model?
@githubrandomuser2017 When SBERT was created, GPT2 was not available.
I never tested GPT2, but I think Mask Language Modeling as used in BERT is the better pre-training task to get sentence embeddings than the causal language model used by GPT2.
But it will be easy to fine-tune and test GPT2 with sentence-transformers.
In your documentation you mention two approaches to using your package to create sentence embeddings.
First, from the Quickstart, you wrote:
Second, from Sentence Embeddings with Transformers, you wrote:
What are the important differences between these two approaches? The only thing I can see is that in the second approach, the
BertModel
model returns token embeddings and then you manually perform pooling (mean or max). If I use this second approach, what would I be missing from usingSentenceTransformer.encode()
?