deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.67k stars 1.91k forks source link

Low "score" for semantic search with exact string #1892

Closed timpal0l closed 2 years ago

timpal0l commented 2 years ago

Question I followed the FAQ tutorial to perform semantic search on a dataset with questions and predefined answers.

The whole pipeline works, it embeds the documents with sentence-transformers/all-MiniLM-L6-v2, updates the document store that in my case is hosted at elastic cloud. And also performs the search.

However the score attribut is never above 0.60 isch, is this not just vanilla cosine similarity?

In the dataframe you have the following sample at the first row.

content answer
What is a novel coronavirus? A novel coronavirus is a new coronavirus that has not been previously identi...

If I even search with the exact query "What is a novel coronavirus?" the score is 0.6059359910992558

Answers:
[   {   'answer': 'A novel coronavirus is a new coronavirus that has not been '
                  'previously identified. The virus causing coronavirus '
                  'disease 2019 (COVID-19), is not the same as the '
                  'coronaviruses that commonly circulate among humans and '
                  'cause mild illness, like the common cold.\n'
                  '\n'
                  'A diagnosis with coronavirus 229E, NL63, OC43, or HKU1 is '
                  'not the same as a COVID-19 diagnosis. Patients with '
                  'COVID-19 will be evaluated and cared for differently than '
                  'patients with common coronavirus diagnosis.',
        'context': 'A novel coronavirus is a new coronavirus that has not been '
                   'previously identified. The virus causing coronavirus '
                   'disease 2019 (COVID-19), is not the same as the '
                   'coronaviruses that commonly circulate among humans and '
                   'cause mild illness, like the common cold.\n'
                   '\n'
                   'A diagnosis with coronavirus 229E, NL63, OC43, or HKU1 is '
                   'not the same as a COVID-19 diagnosis. Patients with '
                   'COVID-19 will be evaluated and cared for differently than '
                   'patients with common coronavirus diagnosis.',
        'score': 0.6059359910992558}, ...

Should not the score attribute be 1.0?

Notebook

from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import EmbeddingRetriever
import pandas as pd
import requests
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.pipelines import FAQPipeline
from haystack.utils import print_answers

document_store = ElasticsearchDocumentStore(host="123456.europe-north1.gcp.elastic-cloud.com", 
                                            scheme='https',
                                            port=9243, 
                                            index="document",
                                            username="username", 
                                            password="password", 
                                            embedding_field="question_emb",
                                            embedding_dim=384,
                                            excluded_meta_data=["question_emb"])

retriever = EmbeddingRetriever(document_store=document_store, 
                               embedding_model="sentence-transformers/all-MiniLM-L6-v2", 
                               use_gpu=False)

temp = requests.get("https://raw.githubusercontent.com/deepset-ai/COVID-QA/master/data/faqs/faq_covidbert.csv")
open('small_faq_covid.csv', 'wb').write(temp.content)

# Get dataframe with columns "question", "answer" and some custom metadata
df = pd.read_csv("small_faq_covid.csv")
# Minimal cleaning
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

# Get embeddings for our questions from the FAQs
questions = list(df["question"].values)
df["question_emb"] = retriever.embed_queries(texts=questions)
df = df.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

pipe = FAQPipeline(retriever=retriever)
prediction = pipe.run(query="What is a novel coronavirus?", params={"Retriever": {"top_k": 10}})
print_answers(prediction, details="medium")
document_store.get_all_documents()

[<Document: {'content': 'What is a novel coronavirus?', 'content_type': 'text', 'score': None, 'meta': {'answer': 'A novel coronavirus is a new coronavirus that has not been previously identified. The virus causing coronavirus disease 2019 (COVID-19), is not the same as the coronaviruses that commonly circulate among humans and cause mild illness, like the common cold.\n\nA diagnosis with coronavirus 229E, NL63, OC43, or HKU1 is not the same as a COVID-19 diagnosis. Patients with COVID-19 will be evaluated and cared for differently than patients with common coronavirus diagnosis.', 'answer_html': '<p>A novel coronavirus is a new coronavirus that has not been previously identified. The virus causing coronavirus disease 2019 (COVID-19), is not the same as the <a href="/coronavirus/types.html">coronaviruses that commonly circulate among humans</a>&nbsp;and cause mild illness, like the common cold.</p>\n<p>A diagnosis with coronavirus 229E, NL63, OC43, or HKU1 is not the same as a COVID-19 diagnosis. Patients with COVID-19 will be evaluated and cared for differently than patients with common coronavirus diagnosis.', 'link': '\nhttps://www.cdc.gov/coronavirus/2019-ncov/faq.html', 'source': 'Center for Disease Control and Prevention (CDC)', 'category': 'Coronavirus Disease 2019 Basics', 'country': 'USA', 'region': '', 'city': '', 'lang': 'en', 'last_update': '2020/03/17', 'name': 'Frequently Asked Questions'}, 'embedding': None, 'id': '5307f21453316d3c6add35421075692d'}>
  1. Why is not the score attribute 1.0 ?
  2. Why is the answer and content fields the same in the prediction output? They are clearly not the same in according to the document_store.get_all_documents()

FAQ Check

tstadel commented 2 years ago

Hey @timpal0l, the default similarity function of ElasticsearchDocumentStore is dot_product. As the dot product can take any positive or negative value we scale it with expit to a value between 0 and 1. Note, that this does not guarantee a perfect match to receive a value of 1 as the value depends on the magnitudes of the vectors itself and not only their angle. To achieve this, you can set the similarity function to cosine within ElasticsearchDocumentStore.

To correctly understand the second question: do you rather mean the conteXt of the answer?

timpal0l commented 2 years ago

I see, but even if we use plain dot_product, the query = "What is a novel coronavirus?" and the content="What is a novel coronavirus?" does have the exact same embeddings, hence their dot_product would be 1.0 aswell. Do you have any clue why the dot product between the same texts get 0.6059359910992558?

The second question was probably just me mixing up context and content... 😆

If you follow the FAQ tutorial and query with something that has an exact content, do you get 1.0?

tstadel commented 2 years ago

To answer your first question, the raw dot product here produces a value of sum 40ish (I previously saw that during debugging). As mentioned before, this value coming from the dot product can be anything from -verybignumber to +verybignumber depending on the magnitudes of the vectors (they are not normalized before calculating the dot product). Afterwards we scale this value to something between 0 and 1 by dividing by 100 and applying the expit function. This scaling however has not the property to produce 1 if there is perfect match. It's merely a heuristic to not produce insanely big numbers.

timpal0l commented 2 years ago

I see, but why use this option as default? I mean the number 0.6 isnt that intuitive for a perfect match. I think its a "better practice" to use cosine similarity normalized to the [0, 1] range for interpretation. If I query with a empty string I get 0.5 isch, as a number, meaning that there is a very small range where I would need to find a good threshhold for an application.

tstadel commented 2 years ago

I totally get your point. In your case (finding a good threshold) interpretability of the score is the leading factor. However, (according to our docstrings) dot product simply performs better in some settings like Dense Passage Retrieval, where interpretability is not of special concern. Most of the time we just want to have some order. So we had to make a decision here on the API level.

At least in my opinion, it isn't too intuitive why these two metrics would produce a different order/ranking, this article shows a very illustrative example.

On the other hand, I see no reason to not set similarity="cosine" in the tutorial including a simple description what the cosine values mean. @timpal0l Would you like to make that contribution in a PR?

ZanSara commented 2 years ago

Hello @timpal0l, do you think we can close this issue for now? Or are you interested in contributing a PR?

ZanSara commented 2 years ago

I'm closing for now. If you later want to pick up this issue, feel free to re-open :slightly_smiling_face: