Cannot reproduce the MMLU performance with our sft-ed tulu-2-7b model

zhichaoxu-shufe commented 2 months ago

Hi,

We are trying to reproduce the sft-ed 7B model. We use the same setup as in the open_instruct/finetune_trainer.py, with the exact hyperparameters reported in your paper, but we get a zero-shot MMLU results of 46.9% compared to your reported 50.4%. We use your evaluation script in this repo as well.

Can you help us interpret this performance difference? Is it might be because of the random seed (ordering of SFT data) or other issues in your experience?

Thanks

hamishivi commented 2 months ago

Hi! There are a two things that can significantly affect final model performance:

Make sure your environment matches the package versions we specify in our requirements.txt. We found some issues with accelerate and transformers during training, and pinned versions (or even made forks) to avoid these. In particular, we found a left-padding issue with llama that I am not sure is fixed in newer versions. Internally we found that when someone tried to run evals without the pinned packages, they got MMLU as low as 43.6 (and saw lots of variance during training in general).
Batch size / gradient accumulation can make a difference in performance, especially for metrics like AlpacaEval. This is because loss is averaged across tokens in a minibatch, meaning that longer examples get more weighting. However, since gradients are averaged in gradient accumulation, grad acc weights each minibatch equally (so e.g. grad acc=4, bsz=32 results in a different effective sample weighting to bsz=128). For our official Tulu 2 models I believe we used 4 gradient accumulation steps. If you want to simulate weighting every token the same, you can try setting reduce_loss to sum in our finetune script (see https://github.com/allenai/open-instruct/blob/main/open_instruct/finetune.py#L777).

I hope these details help!

t-li commented 2 months ago

Hi @hamishivi @yizhongw

Thanks a lot for the details. We are trying on more specs as you suggested.

We want to double check the details about the infra. From the Camels paper appendix, there is a line "All models except QLoRA models were trained on a 256-chip (512-chip for 70B DPO training) TPU v3 pod..." Does it mean all models including the SFT ones were trained on TPUs or only preference-based models?

Thanks in advance.

cc @zhichaoxu-shufe

hamishivi commented 2 months ago

Hi - sorry, missed this comment. Correct, for Tulu 2 we used TPUs for all models except QLoRA. I did reproduce the 7b model internally using this repository, though. The TPU codebase we used is less well-documented but available here: https://github.com/hamishivi/EasyLM

allenai / open-instruct

Cannot reproduce the MMLU performance with our sft-ed tulu-2-7b model #148