Closed tudor-berariu closed 8 years ago
Lines 97-104 are doing the following --
For a given generated answer, the generated answer is evaluated against all 10 choose 9
sets of ground truth answers (for loop in line 97). In each such iteration, the evaluation uses the metric --
min(1, (number of matching answers out of 9 ground truth answers)/3)
[line 100].
So if a question has 8 "yes" answers and 2 "no" answers (provided by the workers), the accuracy of a generated answer would be 0.6 for "no" and 1.0 for "yes". Below is the detailed calculation --
accuracy for "no" -- 1/10 * ( 8 * min(1, 2/3) + 2 * min(1, 1/3) ) = 0.6
accuracy for "yes" -- 1/10 * ( 8 * min(1, 7/3) + 2 * min(1, 8/3) ) = 1
BTW, the accuracies that you mentioned -- 0.533 for "no" and 0.2 for "yes", are these the results from running the code (line 97-104 in vqaEval.py)?
As mentioned in the paper, we choose all subsets of all 9 answers so that we have consistency between the human evaluation scores (without collecting an eleventh answer) and the results people will report on automatically generated answers.
Thank you very much for your answer.
I misunderstood line 98, this piece of code precisely: if item!=gtAnsDatum
. I thought it's meant to remove all answers identical to the current one.
Thank you for your time
Yes, it's a subtle one due to gtAnsDatum being an object and not the actual string.
In
vqaEval.py
, lines 97-104, the code that computes the accuracy for a generated answer seems to produce strange values. For example, if a question has 8 "yes" answers and 2 "no" answers (provided by the workers), the accuracy of a generated answer would be 0.533 for "no" and 0.2 for "yes".2/10 * min(1, 8/3)
8/10 * min(1, 2/3)
Can you please explain the reasons for that specific evaluation scheme?