dmmiller612 / bert-extractive-summarizer

Easy to use extractive text summarization with BERT
MIT License
1.39k stars 305 forks source link

Suggestion to move over to sentence-bert for getting sentence embeddings #34

Closed aced125 closed 2 years ago

aced125 commented 4 years ago

Hey Authors,

Since you are tokenizing each sentence separately, I suggest to check out this paper (Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks ) and the corresponding repo (https://github.com/UKPLab/sentence-transformers) from UKP labs in Germany.

They have shown that using the sum of Bert embeddings for each word to represent a sentence does very poorly on benchmarks (but at least better than using the CLS token).

I know you guys are using the second last or third last layer, but it is a trivial transition to move over to sentence-transformers.

In short, using the mean of BERT embeddings gains a spearmans of 0.45 on STS benchmarks, whereas sentence-BERT gains a spearman of 0.84, a significant improvement.

The model is easy enough to use:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")
dmmiller612 commented 4 years ago

Great suggestion! I'll take a look more into this, and see how we can integrate it. Right now we default to using the mean of the word embeddings at the second or third last layer. Still, using the mean approach can bias clustering results for really long/short sentences (which is why we have a parameter for that). I'll do some known checks that I have seen before on the new library to see if some of those same issues exist.

cabhijith commented 4 years ago

Hey @dmmiller612 ! Any updates? I can attest to the fact that SBERT delivers greater results in real systems. It would be great to use that!

cpatrickalves commented 2 years ago

Any updates about this? Seems to be a really good improvement.

dmmiller612 commented 2 years ago

Implemented in version 0.9.0