deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.95k stars 1.85k forks source link

Reader evaluation merges questions with the same text #933

Closed lewtun closed 3 years ago

lewtun commented 3 years ago

Describe the bug Some QA datasets are review-based (e.g. AmazonQA and SubjQA), where the question-answer pairs concern the reviews about a single product. What can happen in these datasets is that the question text can be the same for different products, e.g. in SubjQA the question "What do you think about headphone?" is asked about two different products.

This duplicity of question text produces inflated EM/F1 scores when using FARMReader.eval for evaluation because the question text is used as a key for aggregation here, so the number of potential answers to match with increases.

For reasons I don't fully understand, a similar effect is seen for FARMReader.eval_on_file, e.g. evaluating deepset/minilm-uncased-squad2 on the test set of the electronics domain of SubjQA I find:

{'EM': 0.25139664804469275,
 'f1': 0.32950102625485156,
 'top_n_accuracy': 0.7597765363128491}

whereas the scores shown in the publication are closer to EM = 0.06 and F1 = 0.23 (Figs. 5 & 7).

Error message None (silent bug)

Expected behavior

Additional context Evaluating the Retriever-Reader pipeline for review-based datasets is not currently supported "out-of-the-box" by #904 because one needs to first filter the Retriever for a product ID, calculate the EM/F1 scores per product, and then aggregate to get the overall scores for the pipeline.

I am not sure whether the use case is common enough to warrant a change to the API, but allowing users to evaluate e.g. the ExtractiveQAPipeline on a subset of the data (via a filter) would be welcome 😃

To Reproduce See this Colab notebook: https://colab.research.google.com/drive/1K00ut8_E4GuLS-xk4f_JRJdyRwGNATsO?usp=sharing

System:

lewtun commented 3 years ago

The merging of duplicate questions also appears to occur in BaseDocumentStore.add_eval_data because the documents and labels that are extracted via eval_data_from_json here use the question string as an attribute in each Label instance.

For reference, the test set of the electonics domain of SubjQA has 358 unique question IDs, and 226 unique question strings. Adding the eval data to the document store with add_eval_data indeed returns 226 labels.

brandenchan commented 3 years ago

This should be solved by #1030!

brandenchan commented 3 years ago

Hey @lewtun, we've solved this with #1119!