epfLLM / meditron

Meditron is a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
1.77k stars 159 forks source link

Abnormal evaluation result #23

Closed zhhvvv closed 5 months ago

zhhvvv commented 6 months ago

I evaluation llama-2-70b model on pubmedqa with cot, sc_cot, and multi_seed + sc_cot inference modes, but I got some abnormal evaluation results.

For the cot inference mode: There are only 26 correct answers with 476 ignored, does it normal? For the sc_cot and multi_seed + sc_cot result, I got about 52% acc, different from the result in your paper.

I want to know does the evaluation code is completely same as that you used?

My evaluation result: cot:

====================================
Report accuracy for pubmedqa-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.032
Accuracy (calibrated): 0.6153846153846154
Precision: 0.03709090909090909
Recall: 0.032
F1: 0.033303703703703696
------------------------------------
Correct: 16
Counted: 26
Total: 500
Unable to find answer: 474
Ignored prompts: 474
 ====================================

sc_cot

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================

Multi-seed + sc_cot:

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-1234:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-432:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-32:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================
eric11eca commented 6 months ago

Hi there! Did you finetune your model on associated CoT data for the evaluation task? Or is this from zero-shot CoT of the pretrained raw model?

zhhvvv commented 6 months ago

Thx for the reply! These results were obtained by the pre-trained llama2-70b-base-hf model by zero-shot cot and zero-shot Sc_cot with your evaluation code. #

eric11eca commented 6 months ago

Ah, I see. The results we reported in the paper are from a fine-tuned version of the pre-trained models. The evaluation code in our repo is designed for a specific data format we used for fine-tuning.

It is generally difficult to control the answer format of the model when doing zero-shot CoT unless they have been tuned on instruction data or via DPO. I would recommend trying a model like allenai/tulu-2-dpo-70b and modifying the evaluation code to accommodate its CoT output format.

Hope this helps!

zhhvvv commented 6 months ago

Thank you! So the llama2 model used in the Top Token Selection evaluation ( In Table: 5 of the Paper) has also been fine-tuned for these downstream tasks? fig

eric11eca commented 6 months ago

Exactly, you can see more details about the setup in our paper (Section 6.2, Setup).

zhhvvv commented 6 months ago

Thx, then, what is the few shot strategy used by the top token selection method in tabel 5? (3 shots for 7B models and 5 shots for 70B models or all 0 shot?) And which seed did you choose in the top token selection? Or average the results after three seed experiments (seed=[1234, 432, 32])? Is the divisor for calculating the accuracy rate in the table the total number of cases or the number of counted cases? (Are cases where the answer cannot be parsed be included in the calculation of the accuracy rate?)

eric11eca commented 6 months ago

Hi there! I will address each of your questions below:

Hope these can answer your questions. Best!