Abnormal evaluation result

zhhvvv commented 6 months ago

I evaluation llama-2-70b model on pubmedqa with cot, sc_cot, and multi_seed + sc_cot inference modes, but I got some abnormal evaluation results.

For the cot inference mode: There are only 26 correct answers with 476 ignored, does it normal? For the sc_cot and multi_seed + sc_cot result, I got about 52% acc, different from the result in your paper.

I want to know does the evaluation code is completely same as that you used?

My evaluation result: cot:

====================================
Report accuracy for pubmedqa-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.032
Accuracy (calibrated): 0.6153846153846154
Precision: 0.03709090909090909
Recall: 0.032
F1: 0.033303703703703696
------------------------------------
Correct: 16
Counted: 26
Total: 500
Unable to find answer: 474
Ignored prompts: 474
 ====================================

sc_cot

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================

Multi-seed + sc_cot:

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-1234:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-432:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-32:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================

eric11eca commented 6 months ago

Hi there! Did you finetune your model on associated CoT data for the evaluation task? Or is this from zero-shot CoT of the pretrained raw model?

zhhvvv commented 6 months ago

Thx for the reply! These results were obtained by the pre-trained llama2-70b-base-hf model by zero-shot cot and zero-shot Sc_cot with your evaluation code. #

eric11eca commented 6 months ago

Ah, I see. The results we reported in the paper are from a fine-tuned version of the pre-trained models. The evaluation code in our repo is designed for a specific data format we used for fine-tuning.

It is generally difficult to control the answer format of the model when doing zero-shot CoT unless they have been tuned on instruction data or via DPO. I would recommend trying a model like allenai/tulu-2-dpo-70b and modifying the evaluation code to accommodate its CoT output format.

Hope this helps!

zhhvvv commented 6 months ago

Thank you! So the llama2 model used in the Top Token Selection evaluation ( In Table: 5 of the Paper) has also been fine-tuned for these downstream tasks? fig

eric11eca commented 6 months ago

Exactly, you can see more details about the setup in our paper (Section 6.2, Setup).

zhhvvv commented 6 months ago

Thx, then, what is the few shot strategy used by the top token selection method in tabel 5? (3 shots for 7B models and 5 shots for 70B models or all 0 shot?) And which seed did you choose in the top token selection? Or average the results after three seed experiments (seed=[1234, 432, 32])? Is the divisor for calculating the accuracy rate in the table the total number of cases or the number of counted cases? (Are cases where the answer cannot be parsed be included in the calculation of the accuracy rate?)

eric11eca commented 6 months ago

Hi there! I will address each of your questions below:

As we described in our paper (Section 6.2, Setup), Table 5 reports the performance of finetuned models (i.e., neither zero-shot nor few-shot; they all require finetuning on a training dataset).
If you are interested in the few-shot performance of the pre-trained models (i.e., those that are not finetuned on any training data), please take a look at Table 4: Few-shot Learning results of raw MEDITRON models against open-source pretrained baselines on page 9 as well as Section 6.1, Setup.
We use the greedy decoding strategy, which does not sample the tokens. The seeds you described (i.e., [1234, 432, 32]) are used to sample in-context learning demonstrations from the training sets. For Table 4, we report the average across these three seeds.
We count those that cannot be parsed as incorrect answers, and the accuracy is the division between the number of correct samples and the total number of samples in the test set.

Hope these can answer your questions. Best!

epfLLM / meditron

Abnormal evaluation result #23