Open Jiaxin-Wen opened 4 months ago
Moreover, I find that simply fine-tuning SOTA LMs (e.g., llama-3-8b) on the original training set of GSM8K does not lead to any improvement compared with few-shot performance.
llama-3-8b | GSM8K |
---|---|
few-shot prompting | 55.57 |
fine-tuning | 55.79 |
I would like to know if this aligns with your experiment results. If so, could you please share your data for continued instruction-tuning too? That would be really helpful to reproduce the experiment results in this paper.
Thanks! We used few-shot for the evaluation. You can find more details in our evaluation code and implementation details in the paper.
will you add eos token during pre-training or continue fine-tuning?
as all training data is in one-shot format, I'm wondering whether I should remove eos token during pre-training or fine-tuning to adapt to few-shot evaluation
or is there any other trick that you used to adapt to few-shot evaluation?
In section 2.5, models are continued fine-tuned on several opensource instruction tuning datasets, which includes the training set of GSM8K and MATH.
I'm wondering after continued fine-tuning, are models evaluated still with few-shot prompting or zero-shot prompting. For example, if the model is fine-tuned on GSM8K with the following data format: