deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
15.21k stars 1.76k forks source link

Handle errors separately in evaluators and the run results #7973

Open mrm1001 opened 1 month ago

mrm1001 commented 1 month ago

Context

When running the evaluators over larger datasets, depending on the model, it is very common to run into LLM errors where the output is not valid JSON. For example, while running the benchmark scripts over the ARAGOG dataset, I always have one row that got incorrect JSON, so every time I've run the script I get a score report which is not very useful, such as: metrics score
context_relevance NaN

In that case, the output of the LLM-based evaluation metric whenever there is an error is something like: {'statements': [], 'statement_scores': [], 'score': nan}

As a user, I would like to keep track of the errors that happened during evaluation, so ideally this should be returned as a flag, for example: {'statements': [], 'statement_scores': [], 'score': nan, 'error': True}

Then, in the evaluation score report, we could return the mean of the scores by ignoring the errors: metrics score total_errors
context_relevance 0.9 1

Outcome

mrm1001 commented 1 month ago

There is a workaround for this issue where users can use the output of the evaluators individual_scores directly to perform custom aggregation.