Closed Timoeller closed 3 years ago
We made some experiments with pretty promising results:
There is already a pretrained SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')
I suspect training a joint model with our TextPairClassification can give only minor improvements. If this turns out to be used a lot, we could still train one, since in this scenario we do not gain much from the indexing capabilities.
GermanQuAD test set annotations, only took answers coming from completely different text positions. Computed embeddings with the sentencetransformers + cosine sim. Checked all 40 answer pairs with cosine sim > 40, all texts seemed to semantically match:
Same model applied to SQuAD dev set. Checked 180 answer pairs above cosine sim 0.7. All texts seemed to match:
The model seems to be a bit sloppy when it comes to numbers and dates.
When looking at the other spectrum of dissimilar answers we might have a tool here for improving annotations much more easily. Examples of sims below 0.2 are:
Looking at the code I believe a registered metric might be the best case, since we want to parameterize the evaluation with models and a threshold for deciding if a prediction is close to the given gold labels.
That's awesome! Seems like a promising direction for a new, more meaningful QA metric and I can see a couple of usecases for simplifying the labelling as well. I think it makes sense to move forward with the implementation and then eventually compare different models / fine-tune one specific for answer similarity
From what I see here, this is not just relevant for the evaluation of answers for QA but addresses an own search use-case in itself. This solves problems of Keyword search-cases where users want to find mentions in text (e.g. nineteenth-century cartographic techniques) but trying all different formulations is not practical (e.g. nineteenth-century maps, 19th century maps etc.)
Research shows promise - continuing in deepset-ai/haystack#1516 with draft implementation
Research options to evaluate answers based on semantic similarity.