Semantic Answer similarity evaluation

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

Apache License 2.0

17.55k stars 1.91k forks source link

Common metrics for evaluating QA like Exact Match or f1 score are very strict. They can be better applied to settings where an exact entity (name, date or number) needs to be extracted. For more complex answers we want a more loose evaluation.

Inspired by other domains like machine translation we have started experiments with a semantic textual similarity metric between ground truth answer and predicted answer in https://github.com/deepset-ai/FARM/pull/803

We now want to bring this functionality to Haystack.

Prioritize pipeline eval.

deepset-ai / haystack

Semantic Answer similarity evaluation #1241