Discrepancy in Evaluation Results for ProntoQA with gpt-3.5-turbo

yigengjiang commented 1 week ago

Description

I encountered an issue in the code when running the evaluation script. Below are the details of the issue and the steps I took to investigate and attempt a fix.

Steps to Reproduce

When I execute the following evaluate.sh script:

python evaluate.py \
   --dataset_name "ProntoQA" \
   --model_name "gpt-3.5-turbo" \
   --split dev \
   --verification "True"

Outputs

The script outputs:

Result file:  ./verified_results/None_ProntoQA_dev_gpt-3.5-turbo_verified.json
Total records: 104
Correctly predicted 'true': 0
Correctly predicted 'false': 104
Accuracy: 0.00%

The total records are only 104 because the use of gpt-3.5-turbo during logic inference exceeded the budget.

Attempted Fix

I modified the following line in evaluate.py:

predicted_answer = record.get('predicted_answer', '').strip().lower()  # original
predicted_answer = record.get('predicted_choice', '').strip().lower()  # after modification

After modifying the code, I executed evaluate.sh again, resulting in:

Result file:  ./verified_results/None_ProntoQA_dev_gpt-3.5-turbo_verified.json
Total records: 104
Correctly predicted 'true': 25
Correctly predicted 'false': 79
Accuracy: 24.04%

Even after the modification, the accuracy is still significantly lower (24.04%) compared to the result reported in the paper (75.8%), despite the smaller sample size of 104 records. The discrepancy seems too large to be accounted for solely by the sample size.

Request

Could you please investigate this issue further? It seems there might be an underlying problem affecting the evaluation accuracy.

Thank you for your assistance.

Aiden0526 commented 1 week ago

Hi Yigengjiang,

Thank you for spotting the error and sorry for any inconvenience.

I have checked the code and there is a typo in the prompt construction causing this issue. I have already fixed it and tried the first 30 instances and got an accuracy of 70%. Please try again and you should get a similar result.

Thanks.

yigengjiang commented 1 week ago

Thank you for your response! I have another question regarding the ablation study. Could you clarify why there wasn't an ablation study conducted with and without the translator module? It seems that the translator plays a pivotal role in your method.

Aiden0526 commented 1 week ago

Hi Yigengjiang,

There is an ablation conducted with and without the translator. We mentioned this in section 4.3 model ablation that the translator contributes an improvement of 6.3% in average. If you look at Fig 3 in the paper, this contribution of the translator is calculated through the number on the grey bar (SymbCoT without planner&solver&verifier which is just a translator) minus the number on the pink bar (SymbCoT without translator&planner&solver&verifier which is without translator and any other modules).

Thanks.

Aiden0526 / SymbCoT