Comparing relatedness to key sentences

9j7axvsLuF commented 3 years ago

Hi,

I've been using sentence-transformers for a while, and I really love it - thanks a lot for your work!

I have a question about the best way to compare the semantic relatedness of a bunch of documents to two key sentences. Essentially, I want to determine whether there is a significant correlation in my corpus between semantic relatedness to sentence A and semantic relatedness to sentence B.

What I'm doing at the moment is the following:

For each document in my corpus, I take the top k sentences most similar to sentence A, and compute the average cosine similarity of these sentences to sentence A (I found that k = 5 works best).
I do the same for sentence B.
I plot the scores for each document, with average similarity of top k sentences to sentence A on one axis, and average similarity of top k sentences to sentence B on another.

The reason why I take the average similarity of top k sentences to each key sentence is that there is quite a bit of semantic diversity within each doc, so a lot of information would be lost by just taking the average similarity to all sentences from each doc to each key sentence.

However, what I find is that no matter what key sentences I use as A and B, I get at least a modest correlation on the plot (using linear regression). Is this expected? Am I just picking up noise?

Another thing: the higher the k value is, the stronger the correlation is, no matter what key sentences I'm looking at.

Any suggestions/advice would be most appreciated!

nreimers commented 3 years ago

Hi @9j7axvsLuF What is the cosine similarity between sentence A and B. If it is > 0, then when a docs scores higher with sentence A, one would expect to also score higher with sentence B.

Note that the vector space is not perfect. Even when sentence A and B do not really share anything in common, we can have a cosine similarity between both > 0.

Further, maybe the correlation is due to different doc lengths? The longer the document, the higher the chance that you have at a sentence that scores high with sentence A and sentence B.

9j7axvsLuF commented 3 years ago

Hi @nreimers

Many thanks for your quick reply.

Your comment about doc length makes a lot of sense. I wonder how I could mitigate that. I could compare paragraphs instead of documents, but ultimately that wouldn't be as useful.

Regarding the issue of the two target sentences having a cossim > 0, that also makes a lot of sense - I'll have to double check on a case by case basis. Do you have a suggestion to offset this? Maybe using another metric than cosine similarity to compare the relatedness of documents to sentence embeddings?

Finally, which model would be best suited for this task in your opinion? (I'm using roberta-large-nli-stsb-mean-tokens because it has the best score on the STS benchmark).

Thanks in advance!

UKPLab / sentence-transformers

Comparing relatedness to key sentences #590