UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.81k stars 2.43k forks source link

Sentence transformers for long texts #1166

Open chaalic opened 3 years ago

chaalic commented 3 years ago

I am currently working on semantic similarity for comparing business descriptions. To this end, I'm using sentence transformers to vectorize the texts and cosine similarity as a comparison metric. However, the texts can be pretty long and are automatically truncated at the 512th token (and a lot of information is lost). My question is the following : would splitting the texts, vectorizing them and averaging sentence embeddings for every text be a good idea ? Are there any resources/articles that discuss this matter ?

Thank you!

nreimers commented 3 years ago

I think it is not a good idea.

Better would be to to a many-to-many comparison: For two descriptions, you split into paragraphs, compute the embeddings. Then compute for all available embeddings the pairwise similarity scores and then the maximum similarity as in BERTScore: https://arxiv.org/abs/1904.09675

chaalic commented 3 years ago

Thank you for your answer.

I took a look at the article, indeed it would make more sense to do it this way, especially since the business descriptions are in fact collections of texts from different sources. However, since I have +2.5 million business descriptions, I doubt the process of computing pairwise similarity scores would eat up too much time.

nreimers commented 3 years ago

Computing the similarities is way faster than computing the embedding. However, you have to implement such that you don't run out-of-memory.

2.5 million descriptions is not that much, it should be do-able.

K-Mike commented 2 years ago

@nreimers where can I setup max sequence?

nreimers commented 2 years ago

@K-Mike https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

RaphaelSilv commented 1 year ago

@chaalic did you have any progress using BERTScore?

I am currently trying to measure the similarity between essays written by students and a given theme and somehow correlate it with the score the student got. For that, I broke the essays into sentences, smaller than the max_token to avoid truncation, and for each sentence calculated the scores with the corresponding theme (usually a one-line sentence).

After the calculations, I got the average of the scores and plotted the results into a scatter plot chart.

I was positive that it would work, maybe not entirely perfectly work, but at least show some good results. But as the images below show, I am far from that. Does anyone have any inputs that would help to improve these results? image image image

@nreimers