Closed jordane95 closed 1 week ago
@jordane95 it could be different evaluation frameworks between open-instruct
and the Llama team. This happens a lot :)
@jordane95 it could be different evaluation frameworks between
open-instruct
and the Llama team. This happens a lot :)
But you successfully reproduced the eval results of the base model...
@jordane95 what do you mean? Can you be more specific, I think I misunderstood your question.
@jordane95 what do you mean? Can you be more specific, I think I misunderstood your question.
Let's look at the table below
mmlu-5shot | llama65b | llama65b-instruct |
---|---|---|
llama paper | 63.4 | 68.9 |
tulu paper | 63.3 | 61.4 |
diff | 0.1 | 7.5 |
Looking at the diff, I bet it might not be the eval framework since the base model performs nearly the same, but there is a large gap in the instruct model performance.
These are differences in eval setup. E.g. 0 vs 5 shot, and there are other settings. We've seen this before :)
Llama paper
Tulu 1 paper
There's a difference between EM and MC. See this blog post for more details https://huggingface.co/blog/open-llm-leaderboard-mmlu
Hi, just to add some other notes: the llama-instruct model in the Llama 1 paper is trained purely on FLAN data, which is useful for MMLU, while for Tulu v1, the 65B model only saw ~100k FLAN samples (vs the full dataset size, which is something like 2 million datapoints). This may be why llama-instruct achieves much stronger MMLU - it is trained on much more of FLAN! Additionally, the tulu v2 mixture itself is much smaller than flan (~400k samples vs ~2 million).
We did use the same evaluation setup, as best we could, although we cannot rule out the possibility of small changes. One thing we did change is that we used 8bit quantization during MMLU evaluation for tulu v1, which may degrade scores slightly. However, I think the big shift from using only FLAN to a small mixture of it likely explains this discrepency.
Hi,
I noticed that tuluv1 did an ablation of different instruction datasets, including flanv2, which is used in the original llama paper to train an instruct model. However, I find that the MMLU 5shot result of 65b llama model after instruct in your tuluv1 paper cannot match the number reported in the llama paper. It reported a 5 point improvement, but Table 8 in tuluv1 paper observed nearly no if not negative effect after instruct. What might be the reason?