UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

Comparing relatedness to key sentences #590

Open 9j7axvsLuF opened 3 years ago

9j7axvsLuF commented 3 years ago

Hi,

I've been using sentence-transformers for a while, and I really love it - thanks a lot for your work!

I have a question about the best way to compare the semantic relatedness of a bunch of documents to two key sentences. Essentially, I want to determine whether there is a significant correlation in my corpus between semantic relatedness to sentence A and semantic relatedness to sentence B.

What I'm doing at the moment is the following:

The reason why I take the average similarity of top k sentences to each key sentence is that there is quite a bit of semantic diversity within each doc, so a lot of information would be lost by just taking the average similarity to all sentences from each doc to each key sentence.

However, what I find is that no matter what key sentences I use as A and B, I get at least a modest correlation on the plot (using linear regression). Is this expected? Am I just picking up noise?

Another thing: the higher the k value is, the stronger the correlation is, no matter what key sentences I'm looking at.

Any suggestions/advice would be most appreciated!

nreimers commented 3 years ago

Hi @9j7axvsLuF What is the cosine similarity between sentence A and B. If it is > 0, then when a docs scores higher with sentence A, one would expect to also score higher with sentence B.

Note that the vector space is not perfect. Even when sentence A and B do not really share anything in common, we can have a cosine similarity between both > 0.

Further, maybe the correlation is due to different doc lengths? The longer the document, the higher the chance that you have at a sentence that scores high with sentence A and sentence B.

9j7axvsLuF commented 3 years ago

Hi @nreimers

Many thanks for your quick reply.

Your comment about doc length makes a lot of sense. I wonder how I could mitigate that. I could compare paragraphs instead of documents, but ultimately that wouldn't be as useful.

Regarding the issue of the two target sentences having a cossim > 0, that also makes a lot of sense - I'll have to double check on a case by case basis. Do you have a suggestion to offset this? Maybe using another metric than cosine similarity to compare the relatedness of documents to sentence embeddings?

Finally, which model would be best suited for this task in your opinion? (I'm using roberta-large-nli-stsb-mean-tokens because it has the best score on the STS benchmark).

Thanks in advance!