GT-Vision-Lab / VQA

Other
363 stars 139 forks source link

Strange evaluation #1

Closed tudor-berariu closed 8 years ago

tudor-berariu commented 8 years ago

In vqaEval.py, lines 97-104, the code that computes the accuracy for a generated answer seems to produce strange values. For example, if a question has 8 "yes" answers and 2 "no" answers (provided by the workers), the accuracy of a generated answer would be 0.533 for "no" and 0.2 for "yes".

Can you please explain the reasons for that specific evaluation scheme?

AishwaryaAgrawal commented 8 years ago

Lines 97-104 are doing the following --

For a given generated answer, the generated answer is evaluated against all 10 choose 9 sets of ground truth answers (for loop in line 97). In each such iteration, the evaluation uses the metric -- min(1, (number of matching answers out of 9 ground truth answers)/3) [line 100].

So if a question has 8 "yes" answers and 2 "no" answers (provided by the workers), the accuracy of a generated answer would be 0.6 for "no" and 1.0 for "yes". Below is the detailed calculation --

accuracy for "no" -- 1/10 * ( 8 * min(1, 2/3) + 2 * min(1, 1/3) ) = 0.6 accuracy for "yes" -- 1/10 * ( 8 * min(1, 7/3) + 2 * min(1, 8/3) ) = 1

BTW, the accuracies that you mentioned -- 0.533 for "no" and 0.2 for "yes", are these the results from running the code (line 97-104 in vqaEval.py)?

StanislawAntol commented 8 years ago

As mentioned in the paper, we choose all subsets of all 9 answers so that we have consistency between the human evaluation scores (without collecting an eleventh answer) and the results people will report on automatically generated answers.

tudor-berariu commented 8 years ago

Thank you very much for your answer. I misunderstood line 98, this piece of code precisely: if item!=gtAnsDatum. I thought it's meant to remove all answers identical to the current one.

Thank you for your time

StanislawAntol commented 8 years ago

Yes, it's a subtle one due to gtAnsDatum being an object and not the actual string.