Closed zhichaoxu-shufe closed 2 months ago
Hi! There are a two things that can significantly affect final model performance:
reduce_loss
to sum
in our finetune script (see https://github.com/allenai/open-instruct/blob/main/open_instruct/finetune.py#L777).I hope these details help!
Hi @hamishivi @yizhongw
Thanks a lot for the details. We are trying on more specs as you suggested.
We want to double check the details about the infra. From the Camels paper appendix, there is a line "All models except QLoRA models were trained on a 256-chip (512-chip for 70B DPO training) TPU v3 pod..." Does it mean all models including the SFT ones were trained on TPUs or only preference-based models?
Thanks in advance.
cc @zhichaoxu-shufe
Hi - sorry, missed this comment. Correct, for Tulu 2 we used TPUs for all models except QLoRA. I did reproduce the 7b model internally using this repository, though. The TPU codebase we used is less well-documented but available here: https://github.com/hamishivi/EasyLM
Hi,
We are trying to reproduce the sft-ed 7B model. We use the same setup as in the open_instruct/finetune_trainer.py, with the exact hyperparameters reported in your paper, but we get a zero-shot MMLU results of 46.9% compared to your reported 50.4%. We use your evaluation script in this repo as well.
Can you help us interpret this performance difference? Is it might be because of the random seed (ordering of SFT data) or other issues in your experience?
Thanks