I'm not able to match the 3-shot eval results reported in the paper for the pretrained model.
I downloaded the Meditron-7b model from HF.
For example, for MedQA I get 0.353, while the paper reports 0.287±0.008
My command was: ./inference_pipeline.sh -b medqa4 -c meditron-7b -s 3 -m 0 -out_dir out_dir
On PubMedQA, I got 0.486, but the paper reports .693±.151.
I'm not able to match the 3-shot eval results reported in the paper for the pretrained model. I downloaded the Meditron-7b model from HF. For example, for MedQA I get 0.353, while the paper reports 0.287±0.008 My command was:
./inference_pipeline.sh -b medqa4 -c meditron-7b -s 3 -m 0 -out_dir out_dir
On PubMedQA, I got 0.486, but the paper reports .693±.151.