Closed lewtun closed 3 years ago
The merging of duplicate questions also appears to occur in BaseDocumentStore.add_eval_data
because the documents and labels that are extracted via eval_data_from_json
here use the question string as an attribute in each Label
instance.
For reference, the test set of the electonics domain of SubjQA has 358 unique question IDs, and 226 unique question strings. Adding the eval data to the document store with add_eval_data
indeed returns 226 labels.
This should be solved by #1030!
Hey @lewtun, we've solved this with #1119!
Describe the bug Some QA datasets are review-based (e.g. AmazonQA and SubjQA), where the question-answer pairs concern the reviews about a single product. What can happen in these datasets is that the question text can be the same for different products, e.g. in SubjQA the question "What do you think about headphone?" is asked about two different products.
This duplicity of question text produces inflated EM/F1 scores when using
FARMReader.eval
for evaluation because the question text is used as a key for aggregation here, so the number of potential answers to match with increases.For reasons I don't fully understand, a similar effect is seen for
FARMReader.eval_on_file
, e.g. evaluatingdeepset/minilm-uncased-squad2
on the test set of theelectronics
domain of SubjQA I find:whereas the scores shown in the publication are closer to EM = 0.06 and F1 = 0.23 (Figs. 5 & 7).
Error message None (silent bug)
Expected behavior
FARMReader
on review-based datasets.Additional context Evaluating the Retriever-Reader pipeline for review-based datasets is not currently supported "out-of-the-box" by #904 because one needs to first filter the Retriever for a product ID, calculate the EM/F1 scores per product, and then aggregate to get the overall scores for the pipeline.
I am not sure whether the use case is common enough to warrant a change to the API, but allowing users to evaluate e.g. the
ExtractiveQAPipeline
on a subset of the data (via a filter) would be welcome 😃To Reproduce See this Colab notebook: https://colab.research.google.com/drive/1K00ut8_E4GuLS-xk4f_JRJdyRwGNATsO?usp=sharing
System: