Return confidence probability of the answer.

alex-movila commented 5 years ago

Hi, I need to know confidence level of prediction in order to return a default answer if confidence too low.

andrelmfarias commented 5 years ago

Hi @alex-movila,

Indeed, this feature would be very useful. However the confidence level of prediction for an answer is not that evident...

The cdQA pipeline has two different scores: Retriever Score and Reader Score.

The retriever score is based on the cosine similarity between the tf-idf features of the documents and the question (that's how it ranks and selects the documents in order to send them to the Reader)

Then, the Reader predicts an answer for each paragraph and computes a score for that specific answer in its specific paragraph, by applying a softmax along all possible answers in that pragraph. This score cannot be used to compare the final answer for each paragraph, as the paragraphs are not related to each other and the score is a softmax applied "inside the paragraph". To select the final answer we use the same approach as DrQA: we do not normalize the exponentials of the softmax and compare the unnormalized exponentials of the answer for each paragraph (in fact, we do not even compute the exp(), we compare the logits directly).

As you can see, the score sent by the Reader is not a proper confidence probability of the answer as it is not comparable between all the possible answer.

One could think that the solution is to compute a softmax along the predicted answers of each paragraph. But this could lead to misleading confidence probabilities. Think about a question that has a correct answer that is present in two different paragraphs (the same answer), when we apply the softmax on these answers we will have a score close to 0.5 for each, while if the answer is found in only one paragraph its score will be close to 1...

The Retriever score is not a perfect confidence score either, as it does not compute the probability of the answer.

With @fmikaelian, we are thinking about a solution for that.

An idea would be to train the BERT Reader on SQuAD 2.0, which is a dataset with some questions labeled with "no answer". A model trained on such dataset should be able to identify when there is not an answer for the question in the paragraph.

alex-movila commented 5 years ago

For now I think even an imperfect solution would be OK. Just return the scores or logits and I can try test against a threshold. If there are multiple candidates with close low scores I can consider there is lack of confidence

cdqa-suite / cdQA

Return confidence probability of the answer. #195