Timoeller commented 3 years ago

Research options to evaluate answers based on semantic similarity.

Timoeller commented 3 years ago

We made some experiments with pretty promising results:

Model

There is already a pretrained SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer') I suspect training a joint model with our TextPairClassification can give only minor improvements. If this turns out to be used a lot, we could still train one, since in this scenario we do not gain much from the indexing capabilities.

Eval on German

GermanQuAD test set annotations, only took answers coming from completely different text positions. Computed embeddings with the sentencetransformers + cosine sim. Checked all 40 answer pairs with cosine sim > 40, all texts seemed to semantically match:

'Sechste Generation', '6. Generation'
'Der Berstwiderstand gibt den benötigten Druck an, um ein Papier zum Bersten zu bringen.', 'Druck, der zum Durchstoßen des Papiers erforderlich ist'
'Wichtig ist, dass jeder gegnerische, passempfangsberechtigte Spieler gedeckt ist', 'beinhaltet alle Spieler, die die gegnerischen Spieler decken, die einen Pass fangen dürfen'

Eval on English

Same model applied to SQuAD dev set. Checked 180 answer pairs above cosine sim 0.7. All texts seemed to match:

'Parliament of the United Kingdom', 'The British Parliament'
'the convecting mantle', 'convection of the mantle'
'prime ideals', 'prime elements' (slightly unsure without context)
'nineteenth-century cartographic techniques', 'nineteenth-century maps'

Modelling errors

The model seems to be a bit sloppy when it comes to numbers and dates.

'oxygen-18', 'oxygen-16'
'1562', '1564'
'January 2010', 'February 2011' But this could be handled with postprocessing or a couple of training examples, which are easy to generate.

Detecting labelling errors

When looking at the other spectrum of dissimilar answers we might have a tool here for improving annotations much more easily. Examples of sims below 0.2 are:

'Schrödinger', 'Newtonian equations'
'conservation of mechanical energy', 'the tension force on a load can be multiplied'
'global', 'in space'
'poor harvest', 'allegedly corrupt machinations of François Bigot'

Implementation

Looking at the code I believe a registered metric might be the best case, since we want to parameterize the evaluation with models and a threshold for deciding if a prediction is close to the given gold labels.

tholor commented 3 years ago

That's awesome! Seems like a promising direction for a new, more meaningful QA metric and I can see a couple of usecases for simplifying the labelling as well. I think it makes sense to move forward with the implementation and then eventually compare different models / fine-tune one specific for answer similarity

mrusic commented 3 years ago

From what I see here, this is not just relevant for the evaluation of answers for QA but addresses an own search use-case in itself. This solves problems of Keyword search-cases where users want to find mentions in text (e.g. nineteenth-century cartographic techniques) but trying all different formulations is not practical (e.g. nineteenth-century maps, 19th century maps etc.)

Timoeller commented 3 years ago

Research shows promise - continuing in deepset-ai/haystack#1516 with draft implementation

deepset-ai / haystack

Similarity based Evaluation of QA Answer #1518

Model

Eval on German

Eval on English

Modelling errors

Detecting labelling errors

Implementation