GQA Evaluation - Githubissues

devaansh100 commented 1 year ago

Thanks for the great work! I wanted to reproduce evaluation on GQA, however, I am not sure how I can do that.

GQA treats question answering as a classification problem, however, I am not sure how I can do it in this setting. Previous models(specifically, LXMERT) trains a classifier for this, but how do we replicate that here?

I tried sentence similarity models, however, the answers are only single words which is why the model is not working too well.

If possible, could you provide me the code/let me know the implementation used for the same?

Thanks in advance!

surisdi commented 1 year ago

Hi, you can pass the possible_options variable as a list to the execute_command function (make sure to also add it in the prompt).

devaansh100 commented 1 year ago

Do you mean the generated code will have the header def execute_command(image, possible_options)? And just to confirm, possible_options will have a length of around 1800 for GQA - would all of these be passed in the prompt as well?

Also, which function in ImagePatch will deal with this parameter - as in, where do I need to add the in-context examples?

HeimingX commented 1 year ago

Hi, I also meet this problem on GQA, may the authors share the relevant code? Many thanks

devaansh100 commented 1 year ago

Wanted to follow up on this thread. Request you to help us with the above @surisdi

Thanks in advance!

surisdi commented 1 year ago

Hi,

Could you provide more details about your setting? We treated the GQA questions as open-ended questions. Therefore, it is not a classification problem. There is no need to provide a pre-determined list of options.

If you want to add in-context examples, you can do it after the ImagePatch class definition. For example, you can write a couple of def execute_command(image): functions before the one with the actual query.

devaansh100 commented 1 year ago

I am trying to reproduce the GQA results.

When treated as open-ended questions, could you please help me reproduce the results in accuracy, as shown in Table 2 of the paper?

To the best of my knowledge, you get accuracy when you define a global answer set and treating it as a classification problem, which is why I'm a bit confused.

devaansh100 commented 1 year ago

Hello, I'm sorry to bother you again with this, but it would be great if you could let me know how you go from the generated answer to the label answer. I did not find other training free approaches for GQA, because of which I am unsure how to do this without a classifier.

Asking ChatGPT to choose an answer is not working as well, since the number of answers is too great(~1800), I even tried pruning the list but the training-free retrieval is not working too well.

surisdi commented 1 year ago

Hi,

The accuracy metric does not require a fixed pre-defined set of possible answers. It is defined as the number of correct answers divided by the total number of questions.

For example, if the question is "how many apples are there?" and the answer is "3", you don't need a list of all possible numbers between 0 to 10 to choose from, you can simply answer "3". Anything else will be wrong. If you get that one right and then you get the next one wrong, your accuracy will be 50%. Does that make sense?

Best,

devaansh100 commented 1 year ago

I understand. However, to find out the correct answer, you would need to do exact matching with this list. To do that, you need to convert the generated answer to an answer in that list, which is typically done with a classifier. To give an example:

'What is around the open window?' 'drapes' 'The draperies are around the window.'

Q: What is around the open window? A from ViperGPT: The draperies are around the window. Actual A: drapes (to get +1 for accuracy)

How do we go from "A from ViperGPT" to "Actual A"? Previous methods like LXMERT etc train a classifier for this, however, how did ViperGPT do so without one

surisdi commented 1 year ago

In theory, you should not use the answers in train/val, because the test set may have new answers.

In order to obtain answers in the correct format, you can provide a couple of in-context examples to ViperGPT so that it produces programs that output the correct type of answer.

For example, at the end of the prompt, and before your query, you can add a ground truth question-code that follows the type of answer you expect.

devaansh100 commented 1 year ago

Traditionally, GQA has been treated as a classification problem - because of which the accuracy metric comes up. If I'm not wrong, the test set does not have answers outside the list of answers in the train/val set.

There are methods to do this(like prompting, extra retrieval steps etc), but I wanted to know which method was followed in the paper, with which I could reproduce the results.

apoorvkh commented 1 year ago

I also wanted to ask: your paper reports GQA accuracy on the test-dev split. But I believe there are 172,174 questions in test-dev_all and 12,578 in test-dev_balanced. Both of these seem like an infeasible quantity.

Did you evaluate on a random subset (like VisProg)?

surisdi commented 1 year ago

@devaansh100 In the paper we directly compare the predicted output (a word) to the ground truth. If it is not the same, we count the answer as wrong.

@apoorvkh We use the test-dev balanced dataset, which is the most commonly used one. Unlike VisProg, we did evaluate on the full set. The bottleneck was the number of queries per minute that Codex allowed, but it is feasible.

apoorvkh commented 1 year ago

Thanks for clarifying!

devaansh100 commented 1 year ago

I see. I did try this but it was not working out for me.

Let me recheck my implementation and get back to you. Thank you!

Zeqing-Wang commented 1 year ago

@devaansh100 In the paper we directly compare the predicted output (a word) to the ground truth. If it is not the same, we count the answer as wrong.

@apoorvkh We use the test-dev balanced dataset, which is the most commonly used one. Unlike VisProg, we did evaluate on the full set. The bottleneck was the number of queries per minute that Codex allowed, but it is feasible.

Thanks for clarifying about the dataset in GQA. I'm tring to reproduce the results, but I found that it is only about 0.3 acc in test-dev balanced dataset. Meanwhile, the acc in test-dev_all is close to 0.5 (close to the result in the paper. We didn't run all of the test-dev_all cases, but we did run all of the test-dev balanced cases.). If possible, could you tell me what could be the reason for this?

Thanks in advance!

scuwyh2000 commented 6 months ago

@devaansh100 In the paper we directly compare the predicted output (a word) to the ground truth. If it is not the same, we count the answer as wrong. @apoorvkh We use the test-dev balanced dataset, which is the most commonly used one. Unlike VisProg, we did evaluate on the full set. The bottleneck was the number of queries per minute that Codex allowed, but it is feasible.

Thanks for clarifying about the dataset in GQA. I'm tring to reproduce the results, but I found that it is only about 0.3 acc in test-dev balanced dataset. Meanwhile, the acc in test-dev_all is close to 0.5 (close to the result in the paper. We didn't run all of the test-dev_all cases, but we did run all of the test-dev balanced cases.). If possible, could you tell me what could be the reason for this?

Thanks in advance!

hi, have you solved this problem?

surisdi commented 6 months ago

Hi, apologies for the delay. We added some code to run benchmarks, including the dataset code that we used for evaluation. I hope it answers your questions.

shirley-wu commented 5 months ago

Hi @surisdi , thank you so much for your latest commit ac2fa260d450951739a3bca819dd103e73e269a7! I tried the new code with a few bug fixes in the code and config files. I evaluated on the first 1000 samples in dev test balanced split, and got 37.6 accuracy. It's still far behind your 48.1 accuracy. Is this behavior expected, or should I keep looking into potential problems?

Currently I run the files by (1) caching all the generated code, and (2) executing the code with another config file.

Config file for step 1 is:

dataset:
    data_path: 'xxx/data/gqa'
    dataset_name: GQA
    split: testdev
    testing: False
    max_samples: 1000
    batch_size: 20
    start_sample: 0

prompt : ./prompts/benchmarks/gqa.prompt
results_dir : ./results/gqa/

load_models:
    maskrcnn: True
    clip: False
    glip: True
    owlvit: False
    tcl: False
    gpt3_list: False
    gpt3_qa: False
    gpt3_guess: False
    depth: False
    blip: True
    saliency: False
    xvlm: True

fixed_code_file: ./prompts/fixed_code/blip2.prompt

Config file for step 2 is:

execute_code: True                                 # Execute the code after generating it. Only applies to main_batch

dataset:
    data_path: 'xxx/data/gqa'
    dataset_name: GQA
    split: testdev
    testing: False
    max_samples: 1000
    batch_size: 20
    start_sample: 0

prompt : ./prompts/benchmarks/gqa.prompt
results_dir : ./results/gqa/

load_models:
    maskrcnn: True
    clip: False
    glip: True
    owlvit: False
    tcl: False
    gpt3_qa: True
    gpt3_general: True
    gpt3_guess: True
    depth: True
    blip: True
    saliency: False
    xvlm: True
    codex: True
    codellama: False

gpt3:  # emmm, davinci is discontinued
    model: chatgpt

fixed_code_file: ./prompts/fixed_code/blip2.prompt
use_cached_codex: True
cached_codex_path: 'results/gqa/testdev/results_xxx.csv'

BTW: I tried to reproduce the 48.1 numbers before with this version bde4c6343825e6a131547cdfdeed8a62c9ac4b11, and only got 29 accuracy

surisdi commented 5 months ago

Hi @shirley-wu, all our numbers were obtained with the code-davinci model. We have not tested with other models, that probably contributes to the difference in results. There are two main differences between the code model and chatgpt. First, chatpgt is a chat-based model, as opposed to a completion-based model. Second, chatgpt is not specifically trained to output code. We added a config file to deal with chat-based models (chatapi.prompt), although we did not test it other than qualitatively on a couple of examples. Have you tried it?

shirley-wu commented 5 months ago

@surisdi thank you for your quick response!

Yes, I'm using chatgpt.prompt. I'm using the same basic_config.yaml as in the master branch, and its codex config is as follows

codex:
    temperature: 0.                                 # Temperature for Codex. (Almost) deterministic if 0
    best_of: 1                                      # Number of tries to choose from. Use when temperature > 0
    max_tokens: 512                                 # Maximum number of tokens to generate for Codex
    prompt: ./prompts/chatapi.prompt                # Codex prompt file, which defines the API. (doesn't support video for now due to token limits)
    model: gpt-3.5-turbo

surisdi commented 5 months ago

You can try using gpt-4 instead of gpt-3.5-turbo. It may be very expensive though.

cvlab-columbia / viper

GQA Evaluation #24