Open lzhptr opened 3 days ago
Hello, I have a small question about the vqa evaluation. Why is the score of vqa score higher than em in the evaluation index here? Em does not count as a score of 1 if it matches exactly one in the answer set, but vqa score requires at least three to count as a score of 1.Thank u very much!
Hi, below is my response to a similar question raised by another researcher:
This is a good question. I double-checked the log of one of the trained models and it says 62.16 VQA and 61.69 EM. I also checked other models. I notice that it is more often that EM is higher than VQA, while in some rare cases, EM is slightly worse than VQA. You are right that theoretically, EM should always be higher than VQA. I suspect that this is due to the inconsistency of metric implementation. For computing the VQA score, to align with other works, we used the official implementation of the VQA score (the one provided by OK-VQA); for exact match, since there is no official implementation, we did a rough processing and matched the string directly. I noticed that in the VQA score, there are some postprocessing steps that allow vague matches, while this is not the case in computing the EM. We released all our codes in the GitHub repository (though RAVQA v2 is still under clean-up, v1 is already there). We did not change the code for computing metrics since v1. If you are interested in diving into this issue, you can compare the two metrics and their implementations. Hope this helps!
Hello, I have a small question about the vqa evaluation. Why is the score of vqa score higher than em in the evaluation index here? Em does not count as a score of 1 if it matches exactly one in the answer set, but vqa score requires at least three to count as a score of 1.Thank u very much!