cvlab-columbia / viper

Code for the paper "ViperGPT: Visual Inference via Python Execution for Reasoning"
Other
1.63k stars 117 forks source link

OkVQA Evaluation #40

Open piyushkhanna00705 opened 7 months ago

piyushkhanna00705 commented 7 months ago

Thanks for the great work! I love how interpretable ViperGPT is! I am trying to evaluate the results on the OkVQA dataset, but I am facing a similar issue as Issue #24 , wherein the model generates the full answer instead of the specific (1-word) answer required for evaluating it as correct for the exact-match accuracy. I also tried being a bit "lenient" in calculating the accuracy by marking the prediction as correct if the answer word existed in the models' full-sentence predictions, however I still got an accuracy less than that reported in the paper.

Here are evaluation metrics from my experiments: Exact-Match Accuracy (Wrong answer if prediction does not exactly match the answer): 9.435% "Lenient" Accuracy (Correct answer if the answer word exists in the model's full length prediction): 21.62%

I am using GPT-3.5 for code generation and blip2-flan-t5-xl for visual queries. Could using blip2-flan-t5-xl instead of blip2-flan-t5-xxl resulted in such a high drop in accuracy, as I would have expected the "Lenient" Accuracy to be at least higher than the one reported in the paper as it may miscount a few answers as correct even though they aren't?

surisdi commented 6 months ago

Hi, we updated the code with the evaluation code. Additionally, drop in performance can be expected if blip xl is used instead of xxl, and also if GPT-3.5 is used instead of Codex (which we used in our experiments). We did not run the experiments with GPT-3.5, so we do not have numbers about how much it affects not using Codex, but qualitatively GPT-3.5, is not as good (maybe it is just a matter of prompt engineering, as GPT-3.5 is not code-specific).

But I would suggest using our evaluation code in order to reduce the number of differences with respect to our experiments, so that we can narrow down the number of differences.