Closed zhhvvv closed 5 months ago
Hi there! Did you finetune your model on associated CoT data for the evaluation task? Or is this from zero-shot CoT of the pretrained raw model?
Thx for the reply! These results were obtained by the pre-trained llama2-70b-base-hf model by zero-shot cot and zero-shot Sc_cot with your evaluation code. #
Ah, I see. The results we reported in the paper are from a fine-tuned version of the pre-trained models. The evaluation code in our repo is designed for a specific data format we used for fine-tuning.
It is generally difficult to control the answer format of the model when doing zero-shot CoT unless they have been tuned on instruction data or via DPO. I would recommend trying a model like allenai/tulu-2-dpo-70b
and modifying the evaluation code to accommodate its CoT output format.
Hope this helps!
Thank you! So the llama2 model used in the Top Token Selection evaluation ( In Table: 5 of the Paper) has also been fine-tuned for these downstream tasks?
Exactly, you can see more details about the setup in our paper (Section 6.2, Setup).
Thx, then, what is the few shot strategy used by the top token selection method in tabel 5? (3 shots for 7B models and 5 shots for 70B models or all 0 shot?) And which seed did you choose in the top token selection? Or average the results after three seed experiments (seed=[1234, 432, 32])? Is the divisor for calculating the accuracy rate in the table the total number of cases or the number of counted cases? (Are cases where the answer cannot be parsed be included in the calculation of the accuracy rate?)
Hi there! I will address each of your questions below:
Hope these can answer your questions. Best!
I evaluation llama-2-70b model on pubmedqa with cot, sc_cot, and multi_seed + sc_cot inference modes, but I got some abnormal evaluation results.
For the cot inference mode: There are only 26 correct answers with 476 ignored, does it normal? For the sc_cot and multi_seed + sc_cot result, I got about 52% acc, different from the result in your paper.
I want to know does the evaluation code is completely same as that you used?
My evaluation result: cot:
sc_cot
Multi-seed + sc_cot: