Inconsistent Results on EgoSchema

Poeroz commented 4 months ago

Hi, Thanks for your great work! I tried to reproduce your results on EgoSchema but found some inconsistency. Specifically, I tried to reproduce the results with standard prompt and (C, Q) —> S prompt with the following command:

standard prompt

python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json

Results:
    "num_total": 500,
    "num_valids": 453,
    "num_corrects": 266,
    "acc": 0.532,

(C, Q) —> S prompt

python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json

python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json

Results:
    "num_total": 500,
    "num_valids": 493,
    "num_corrects": 278,
    "acc": 0.556,

However, it seems the results are different with the reported results in the README:

LaViLa  gpt-3.5-turbo-1106  standard    55.2
LaViLa  gpt-3.5-turbo-1106  (C, Q) —> S 58.8

I have not modified any code and use the captions you released. Any possible reasons for the inconsistency? I also noticed that the results in the README are slightly different with those in the paper. Could you please tell me what is the reason behind? Thank you!

Best regards

CeeZh commented 4 months ago

Hi Qingkai,

Thanks for reaching out!

The short answer is that GPT models got updated. Even the models with the same name (e.g. gpt-3.5-turbo-1106) keep changing.

The GPT models are secretly updated. They are trained to avoid answering sensitive questions. I believe recent updates make GPT more conservative, i.e. refuse to answer questions that it is not sure about. That might be why your "num_valids" is much less than "num_total". Our output files ( https://drive.google.com/file/d/1d7a-FuQzdfQ7ZAzU5Y8HJpog1gm_sye_/view?usp=drive_link) were generated 3 months ago. If you check our standard_qa_1106.json, the "num_valids" is 500. Our metric is strict because it considers invalid examples to be false. However, it also makes sense to use random guesses. If we apply random guesses, the standard prompt accuracy should be (0.532 500 + 0.2 (500 - 453)) / 500 = 55.1%. This is already very close to 55.2%.

For the (C, Q) —> S prompt, I think the GPT model update is the main reason. Also, we used "--temperature 1.0" so there might be some randomness.

In our paper we used gpt-3.5-turbo-0613 because most of our experiments were finished before gpt-3.5-turbo-1106 was released. That is why the numbers in our paper are different from the repo.

Hope this email answers your questions!

Best, Ce

On Tue, Apr 9, 2024 at 8:28 AM Qingkai Fang @.***> wrote:

Hi, Thanks for your great work! I tried to reproduce your results on EgoSchema but found some inconsistency. Specifically, I tried to reproduce the results with standard prompt and (C, Q) —> S prompt with the following command:

standard prompt

python main.py --model gpt-3.5-turbo-1106 --output_base_path output/egoschema --output_filename standard_qa_1106.json

Results: "num_total": 500, "num_valids": 453, "num_corrects": 266, "acc": 0.532,

(C, Q) —> S prompt

python main.py --model gpt-3.5-turbo-1106 --task sum --prompt_type sum_q --num_words_in_sum 500 --temperature 1.0 --output_base_path output/egoschema --output_filename sum_q_500_1106.json

python main.py --model gpt-3.5-turbo-1106 --prompt_type qa_sum --data_path output/egoschema/sum_q_500_1106_data.json --output_base_path output/egoschema --output_filename qa_sum_q_500_1106.json

Results: "num_total": 500, "num_valids": 493, "num_corrects": 278, "acc": 0.556,

However, it seems the results are different with the reported results in the README:

LaViLa gpt-3.5-turbo-1106 standard 55.2 LaViLa gpt-3.5-turbo-1106 (C, Q) —> S 58.8

I have not modify any code and use the captions you released. Any possible reasons for the inconsistency? I also noticed that the results in the README are slightly different with those in the paper. Could you please tell me what is the reason behind? Thank you!

Best regards

— Reply to this email directly, view it on GitHub https://github.com/CeeZh/LLoVi/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKXZD56CYPREI3Z6O33DUX3Y4PNHBAVCNFSM6AAAAABF6PI4JGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTGMZSHE4DMMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Poeroz commented 4 months ago

Hi Ce,

Thanks for you quick response! I have understood the reason for inconsistency results. Thank you again for your great work!

Best regards, Qingkai

CeeZh / LLoVi

Inconsistent Results on EgoSchema #4