Closed timpal0l closed 2 years ago
Hey @timpal0l,
the default similarity function of ElasticsearchDocumentStore is dot_product
. As the dot product can take any positive or negative value we scale it with expit to a value between 0 and 1. Note, that this does not guarantee a perfect match to receive a value of 1 as the value depends on the magnitudes of the vectors itself and not only their angle. To achieve this, you can set the similarity function to cosine
within ElasticsearchDocumentStore.
To correctly understand the second question: do you rather mean the conteXt of the answer?
I see, but even if we use plain dot_product, the query = "What is a novel coronavirus?" and the content="What is a novel coronavirus?" does have the exact same embeddings, hence their dot_product would be 1.0 aswell. Do you have any clue why the dot product between the same texts get 0.6059359910992558
?
The second question was probably just me mixing up context and content... 😆
If you follow the FAQ tutorial and query with something that has an exact content, do you get 1.0?
To answer your first question, the raw dot product here produces a value of sum 40ish (I previously saw that during debugging). As mentioned before, this value coming from the dot product can be anything from -verybignumber to +verybignumber depending on the magnitudes of the vectors (they are not normalized before calculating the dot product). Afterwards we scale this value to something between 0 and 1 by dividing by 100 and applying the expit function. This scaling however has not the property to produce 1 if there is perfect match. It's merely a heuristic to not produce insanely big numbers.
I see, but why use this option as default? I mean the number 0.6 isnt that intuitive for a perfect match. I think its a "better practice" to use cosine similarity normalized to the [0, 1] range for interpretation. If I query with a empty string I get 0.5 isch, as a number, meaning that there is a very small range where I would need to find a good threshhold for an application.
I totally get your point. In your case (finding a good threshold) interpretability of the score is the leading factor. However, (according to our docstrings) dot product simply performs better in some settings like Dense Passage Retrieval, where interpretability is not of special concern. Most of the time we just want to have some order. So we had to make a decision here on the API level.
At least in my opinion, it isn't too intuitive why these two metrics would produce a different order/ranking, this article shows a very illustrative example.
On the other hand, I see no reason to not set similarity="cosine"
in the tutorial including a simple description what the cosine values mean.
@timpal0l Would you like to make that contribution in a PR?
Hello @timpal0l, do you think we can close this issue for now? Or are you interested in contributing a PR?
I'm closing for now. If you later want to pick up this issue, feel free to re-open :slightly_smiling_face:
Question I followed the FAQ tutorial to perform semantic search on a dataset with questions and predefined answers.
The whole pipeline works, it embeds the documents with
sentence-transformers/all-MiniLM-L6-v2
, updates the document store that in my case is hosted at elastic cloud. And also performs the search.However the
score
attribut is never above0.60
isch, is this not just vanilla cosine similarity?In the dataframe you have the following sample at the first row.
If I even search with the exact query "What is a novel coronavirus?" the
score
is0.6059359910992558
Should not the
score
attribute be1.0
?Notebook
score
attribute1.0
?document_store.get_all_documents()
FAQ Check