OkVQA Evaluation - Githubissues

Thanks for the great work! I love how interpretable ViperGPT is! I am trying to evaluate the results on the OkVQA dataset, but I am facing a similar issue as Issue #24 , wherein the model generates the full answer instead of the specific (1-word) answer required for evaluating it as correct for the exact-match accuracy. I also tried being a bit "lenient" in calculating the accuracy by marking the prediction as correct if the answer word existed in the models' full-sentence predictions, however I still got an accuracy less than that reported in the paper.

Here are evaluation metrics from my experiments: Exact-Match Accuracy (Wrong answer if prediction does not exactly match the answer): 9.435% "Lenient" Accuracy (Correct answer if the answer word exists in the model's full length prediction): 21.62%

I am using GPT-3.5 for code generation and blip2-flan-t5-xl for visual queries. Could using blip2-flan-t5-xl instead of blip2-flan-t5-xxl resulted in such a high drop in accuracy, as I would have expected the "Lenient" Accuracy to be at least higher than the one reported in the paper as it may miscount a few answers as correct even though they aren't?

cvlab-columbia / viper

OkVQA Evaluation #40