Closed mrm1001 closed 7 months ago
first draft of the proposal is here
Thanks so much @davidsbatista! Great appraisal!
The metrics we will be implementing in 2.1.0 in Haystack are here and they are basically:
I was wondering whether you could write one final recommendation on what you think the evaluation metrics that we're implementing in Haystack should return? I'm making it up, but something like this:
Aggregate: {SAS={mean: 0.9}, context_relevance: {mean: 0.75}, recall_single: {mean: 0.5}, recall_multi: {mean: 0.6}, faithfulness: {mean: 0.9}
Single: (query1, answer_1}: SAS: 0.8, recall_single: 1, recall_multi: 0.8, context_relevance: 0.9, faithfulness: 0.8
And then the user can aggregate across queries if they want something different from the "mean".
Thanks @mrm1001 - I've updated the page
Thanks @davidsbatista , I consider this done, so feel free to close it.
@davidsbatista I suggest that you close the issue only once the proposal has been reviewed and added to the GitHub repo in this proposals folder: https://github.com/deepset-ai/haystack/tree/main/proposals
thanks for the suggestion @julian-risch - this is indeed a more structured way to present the proposal.
I've added it here: https://github.com/deepset-ai/haystack/pull/7462/files
I cut a bit on many of the ideas and tried to keep it simple and working with PoC code. We can than re-iterate and add other ideas.
User stories:
Example in other libraries:
ragas
langsmith
More context here: https://www.notion.so/deepsetai/Evaluation-1521712b928d4142828232f2df136856?pvs=4