deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.99k stars 1.86k forks source link

Similarity based Evaluation of QA Answer #1518

Closed Timoeller closed 3 years ago

Timoeller commented 3 years ago

Research options to evaluate answers based on semantic similarity.

Timoeller commented 3 years ago

We made some experiments with pretty promising results:

Model

There is already a pretrained SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer') I suspect training a joint model with our TextPairClassification can give only minor improvements. If this turns out to be used a lot, we could still train one, since in this scenario we do not gain much from the indexing capabilities.

Eval on German

GermanQuAD test set annotations, only took answers coming from completely different text positions. Computed embeddings with the sentencetransformers + cosine sim. Checked all 40 answer pairs with cosine sim > 40, all texts seemed to semantically match:

Eval on English

Same model applied to SQuAD dev set. Checked 180 answer pairs above cosine sim 0.7. All texts seemed to match:

Modelling errors

The model seems to be a bit sloppy when it comes to numbers and dates.

Detecting labelling errors

When looking at the other spectrum of dissimilar answers we might have a tool here for improving annotations much more easily. Examples of sims below 0.2 are:

Implementation

Looking at the code I believe a registered metric might be the best case, since we want to parameterize the evaluation with models and a threshold for deciding if a prediction is close to the given gold labels.

tholor commented 3 years ago

That's awesome! Seems like a promising direction for a new, more meaningful QA metric and I can see a couple of usecases for simplifying the labelling as well. I think it makes sense to move forward with the implementation and then eventually compare different models / fine-tune one specific for answer similarity

mrusic commented 3 years ago

From what I see here, this is not just relevant for the evaluation of answers for QA but addresses an own search use-case in itself. This solves problems of Keyword search-cases where users want to find mentions in text (e.g. nineteenth-century cartographic techniques) but trying all different formulations is not practical (e.g. nineteenth-century maps, 19th century maps etc.)

Timoeller commented 3 years ago

Research shows promise - continuing in deepset-ai/haystack#1516 with draft implementation