allenai / open-instruct

Apache License 2.0
1.08k stars 140 forks source link

llama reproduce issue #179

Closed jordane95 closed 1 week ago

jordane95 commented 1 week ago

Hi,

I noticed that tuluv1 did an ablation of different instruction datasets, including flanv2, which is used in the original llama paper to train an instruct model. However, I find that the MMLU 5shot result of 65b llama model after instruct in your tuluv1 paper cannot match the number reported in the llama paper. It reported a 5 point improvement, but Table 8 in tuluv1 paper observed nearly no if not negative effect after instruct. What might be the reason?

natolambert commented 1 week ago

@jordane95 it could be different evaluation frameworks between open-instruct and the Llama team. This happens a lot :)

jordane95 commented 1 week ago

@jordane95 it could be different evaluation frameworks between open-instruct and the Llama team. This happens a lot :)

But you successfully reproduced the eval results of the base model...

natolambert commented 1 week ago

@jordane95 what do you mean? Can you be more specific, I think I misunderstood your question.

jordane95 commented 1 week ago

@jordane95 what do you mean? Can you be more specific, I think I misunderstood your question.

Let's look at the table below

mmlu-5shot llama65b llama65b-instruct
llama paper 63.4 68.9
tulu paper 63.3 61.4
diff 0.1 7.5

Looking at the diff, I bet it might not be the eval framework since the base model performs nearly the same, but there is a large gap in the instruct model performance.

natolambert commented 1 week ago

These are differences in eval setup. E.g. 0 vs 5 shot, and there are other settings. We've seen this before :)

Llama paper

Screenshot 2024-06-29 at 10 11 03 AM

Tulu 1 paper

Screenshot 2024-06-29 at 10 11 35 AM

There's a difference between EM and MC. See this blog post for more details https://huggingface.co/blog/open-llm-leaderboard-mmlu

hamishivi commented 2 days ago

Hi, just to add some other notes: the llama-instruct model in the Llama 1 paper is trained purely on FLAN data, which is useful for MMLU, while for Tulu v1, the 65B model only saw ~100k FLAN samples (vs the full dataset size, which is something like 2 million datapoints). This may be why llama-instruct achieves much stronger MMLU - it is trained on much more of FLAN! Additionally, the tulu v2 mixture itself is much smaller than flan (~400k samples vs ~2 million).

We did use the same evaluation setup, as best we could, although we cannot rule out the possibility of small changes. One thing we did change is that we used 8bit quantization during MMLU evaluation for tulu v1, which may degrade scores slightly. However, I think the big shift from using only FLAN to a small mixture of it likely explains this discrepency.