Open piyushkhanna00705 opened 11 months ago
Hi, we updated the code with the evaluation code. Additionally, drop in performance can be expected if blip xl is used instead of xxl, and also if GPT-3.5 is used instead of Codex (which we used in our experiments). We did not run the experiments with GPT-3.5, so we do not have numbers about how much it affects not using Codex, but qualitatively GPT-3.5, is not as good (maybe it is just a matter of prompt engineering, as GPT-3.5 is not code-specific).
But I would suggest using our evaluation code in order to reduce the number of differences with respect to our experiments, so that we can narrow down the number of differences.
Thanks for the great work! I love how interpretable ViperGPT is! I am trying to evaluate the results on the OkVQA dataset, but I am facing a similar issue as Issue #24 , wherein the model generates the full answer instead of the specific (1-word) answer required for evaluating it as correct for the exact-match accuracy. I also tried being a bit "lenient" in calculating the accuracy by marking the prediction as correct if the answer word existed in the models' full-sentence predictions, however I still got an accuracy less than that reported in the paper.
Here are evaluation metrics from my experiments: Exact-Match Accuracy (Wrong answer if prediction does not exactly match the answer): 9.435% "Lenient" Accuracy (Correct answer if the answer word exists in the model's full length prediction): 21.62%
I am using GPT-3.5 for code generation and blip2-flan-t5-xl for visual queries. Could using blip2-flan-t5-xl instead of blip2-flan-t5-xxl resulted in such a high drop in accuracy, as I would have expected the "Lenient" Accuracy to be at least higher than the one reported in the paper as it may miscount a few answers as correct even though they aren't?