deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.8k stars 1.84k forks source link

Retriever & Pipeline Evaluation: Different results from retriever.eval() vs. EvalRetriever node #983

Closed sophgit closed 3 years ago

sophgit commented 3 years ago

Question Hey :) I have been evaluating my pipeline with your updated evaluation tutorial in Colab. Great tutorial, thanks!! However, I noticed that the retriever recall is different when evaluating it separately from the retriever recall in the pipeline evaluation. Shouldn't the value be the same? Top_k_retriever was set to 5 both times. When doing the same evaluation with the older tutorial (with Finder instead of pipeline) the retriever recall is the same in both the single evaluation and finder evaluation.

Additional context the result of retriever_eval_results = retriever.eval(top_k=5, label_index=label_index, doc_index=doc_index) is:

For 255 out of 272 questions (93.75%), the answer was in the top-5 candidate passages selected by the retriever. Retriever Recall: 0.9375 Retriever Mean Avg Precision: 0.8526960784313726

the result of p = Pipeline() p.add_node(component=retriever, name="ESRetriever", inputs=["Query"]) p.add_node(component=eval_retriever, name="EvalRetriever", inputs=["ESRetriever"]) p.add_node(component=reader, name="QAReader", inputs=["EvalRetriever"]) p.add_node(component=eval_reader, name="EvalReader", inputs=["QAReader"]) results = []

for q, l in q_to_l_dict.items(): res = p.run( query=q, top_k_retriever=5, labels=l, top_k_reader=3, index=doc_index, ) results.append(res) is:

Retriever

recall: 0.9265 (252 / 272)

Retriever (Speed)

No indexing performed via Retriever.run() Queries Performed: 272 Query time: 3.9433569139969222s 0.014497635713223978 seconds per query

Is there a reason why the recall deteriorates in the pipeline? Thank you!

brandenchan commented 3 years ago

Hi @sophgit, cool to see you're already using the new evaluation nodes!

So I do expect a difference in Retriever eval stats when run using Retriever.eval() vs a EvalRetriever node but actually I would have expected Retriever.eval() to do worse. The reason is that Retriever.eval() is doing a Closed Domain retrieval evaluation, meaning that a successful retrieval is based on whether the right document is retrieved (this is judged using the document's ID). The EvalRetriever node is actually an Open Domain retrieval. A document is considered correctly retrieved as long as the answer string is contained within the retrieved document, and we don't check its ID.

Open Domain retrieval is easier than Closed Domain so I actually would have expected the opposite results to what you posted. Down the line, we will want to implement Closed Domain in the Eval nodes.

For now, we should definitely look into this problem to see what is going wrong. Currently it seems like you are using a different dataset to what is in the tutorial, which makes this a little bit tricky. But to start, maybe you could help us out by looking a little more closely at the Retriever's predictions in each case.

For the Retriever only case, could you try adding the return_preds=True argument? i.e. res = retriever.eval(return_preds=True) res is now a dictionary and res["predictions"] returns the predictions of the retriever.

For the Evaluation Nodes case, add debug=True when initializing the EvalRetriever i.e. eval_retriever = EvalRetriever(debug=True) After running the evaluation, eval_retriever.log will return all the retriever's predictions.

If you are able to compare the results from each and see which samples differ, that could be really informative for us!

brandenchan commented 3 years ago

Also, do you know if there are any duplicated questions in your dataset? I am wondering if something similar to this (#933) is happening

sophgit commented 3 years ago

Hi @brandenchan,

thank you for your detailed and fast reply! Now I understand the exact difference between open and closed domain. Yes, I am using a different dataset, as I am using manually annotated German data. I checked, although some questions are similar, there are no duplicate questions in the dataset.

I think I found the issue: white spaces. When I compared the "False" results from eval_retriever.log, I noticed that 3 of them were marked as False, although the correct document was included in the retrieved documents. Then I compared the answer labels and saw, that all three answer strings contained a double white space, while the context in the log didn't. In my original file, both answer and context contain the double white spaces. I edited my file, so that both context and answer don't contain any extra white spaces and fixed the answer_starts accordingly. Now evaluating the retriever on its own and in the pipeline results in a recall of 93.75% 🎉 Is it possible that during evaluation the context is cleaned from redundant white spaces while the answer remains the same?

brandenchan commented 3 years ago

@sophgit Great work figuring that out! This is going to help us a lot in finding the exact source of the problem. It's definitely possible that our contexts are being processed differently to the answers.

I will look into this in the coming sprint and post my findings and progress in this issue.

brandenchan commented 3 years ago

Hi @sophgit, I had a chance to look into this issue again. I replicated your situation by creating a single sample where both the document and the answer contained a double whitespace. But in both closed and open domain eval, the retrieval was considered correct.

I have a couple small hypotheses about what might be going wrong but I am wondering whether it would be possible for you to send us this problematic sample so that we can replicate and then fix this issue?

brandenchan commented 3 years ago

As a side note, since you are working on German, I thought I'd bring this to your attention! We trained a German QA model and also a German Dense Passage Retrieval model using a hand annotated dataset that we created. All of it is open sourced now so feel free to try them out if they might be helpful in your use case!

https://deepset.ai/germanquad

sophgit commented 3 years ago

Hi @brandenchan, thanks for looking into the problem. These are the three examples (text from https://onlinemarketing.de/lexikon/definition-customer-relationship-management-crm). Can't upload json, so I attached a txt-file: example.txt And thanks for the hint, I already saw your post :) Can't wait to try the new models and dataset!

Timoeller commented 3 years ago

Hey @sophgit I remember you helping also to fix some issue with the annotation tool and other interesting conversations.

Wow, so it seems you have created 272 domain datapoints for German QA and used them already for eval. Can we actually have a call about this? I would be highly interested in interacting, maybe also seeing how a model trained on GermanQuAD oder GermanDPR could improve your performance. Could you maybe contact me at timo.moeller --at-- deepset.ai?

brandenchan commented 3 years ago

Just as an update, it seems that the step of splitting Documents into Passages can cause whitespace normalization if PreProcessor.split_by == "word". Hence why your Document text has no duplicate whitespaces, but your answers do have them.

We will want to fix it (#1023) but it is at a very low level in the code and probably won't be a quick fix. For now, I have opened a PR to raise a warning (#1022). And one work around for this issue, if it comes up again in future, is to set PreProcessor.split_by == "passage" since this doesn't have the whitespace normalization side effect. Hope this clarifies things and thanks for helping us discover this pesky issue!