Write proposal for presentation of evaluation results

mrm1001 commented 7 months ago

User stories:

I would like to get a single summary score for my RAG pipeline so I can compare several pipeline configurations.
I am not sure what evaluation metrics work best for my RAG pipeline, specially when using the more novel LLM-based ones, so I would like to see the results of the evaluator on the specific queries.
My RAG pipeline has a low aggregate score, so I would like to see examples of specific inputs where the score was low to be able to diagnose what the issue could be.

Example in other libraries:

More context here: https://www.notion.so/deepsetai/Evaluation-1521712b928d4142828232f2df136856?pvs=4

davidsbatista commented 7 months ago

first draft of the proposal is here

https://www.notion.so/deepsetai/Proposal-for-presentation-of-Evaluation-Results-4b1063a227ad4e08950f184c014063da

mrm1001 commented 7 months ago

Thanks so much @davidsbatista! Great appraisal!

The metrics we will be implementing in 2.1.0 in Haystack are here and they are basically:

SAS: gives one score between [0, 1] for each query
document recall (single, multi): gives one score between [0, 1] (in the case of multi) for each query. You could potentially return 0 (not relevant) or 1 (relevant) for single docs.
context relevance: check with @julian-risch but my understanding is that he's planning to implement it like other packages so it will be a score between [0, 1] for each query.
faithfulness: check with @julian-risch but my understanding is that he's planning to implement it like other packages so it will be a score between [0, 1] for each answer

I was wondering whether you could write one final recommendation on what you think the evaluation metrics that we're implementing in Haystack should return? I'm making it up, but something like this:

Aggregate: {SAS={mean: 0.9}, context_relevance: {mean: 0.75}, recall_single: {mean: 0.5}, recall_multi: {mean: 0.6}, faithfulness: {mean: 0.9}

Single: (query1, answer_1}: SAS: 0.8, recall_single: 1, recall_multi: 0.8, context_relevance: 0.9, faithfulness: 0.8

And then the user can aggregate across queries if they want something different from the "mean".

Should we provide any higher level report or visualisation out-of-the-box or leave it to the user to do their own aggregations and error analysis?

davidsbatista commented 7 months ago

Thanks @mrm1001 - I've updated the page

mrm1001 commented 7 months ago

Thanks @davidsbatista , I consider this done, so feel free to close it.

julian-risch commented 7 months ago

@davidsbatista I suggest that you close the issue only once the proposal has been reviewed and added to the GitHub repo in this proposals folder: https://github.com/deepset-ai/haystack/tree/main/proposals

davidsbatista commented 7 months ago

thanks for the suggestion @julian-risch - this is indeed a more structured way to present the proposal.

I've added it here: https://github.com/deepset-ai/haystack/pull/7462/files

I cut a bit on many of the ideas and tried to keep it simple and working with PoC code. We can than re-iterate and add other ideas.

deepset-ai / haystack

Write proposal for presentation of evaluation results #7398