Bump AlpacaEval to 0.6, add AlpacaEval 2

allenai / open-instruct

Apache License 2.0

1.08k stars 140 forks source link

Bump AlpacaEval to 0.6, add AlpacaEval 2 #139

Closed hamishivi closed 1 month ago

hamishivi commented 3 months ago

Recently, Alpaca Eval 0.6 came out with length-controlled AlpacaEval 2. I think this is worth including in our evals, now we are saturating alpacaEval 1. This PR adds alpaca eval 2 as an explicit task in the eval script.

Old tulu 7b dpo score: 85.09, new score: 84.45 Old tulu 13b dpo score: 89.46, new score: 88.29 The scores are a little lower but within noise (~1.5pts). There is some indeterminism with vllm too that might be at play here.

yizhongw commented 2 months ago

The PR lgtm. Feel free to merge, but we need to be aware of the performance inconsistency for our new experiments.

For the performance drop, it is a bit concerning if we don't know the exact reason. I think vllm should be quite deterministic though. Is there any discussion about the indeterminism?

hamishivi commented 2 months ago

re: indeterminism, there isn't particular discussion, but let me wait on merging this until I do 1 or 2 reruns to make sure. Jacob also mentioned that older vllm versions have some indeterminism issues.

hamishivi commented 1 month ago

I came back to this and got the following after running three times with the new code: Tulu 2 13b: 78.8 +/- 0.3, 9.7 +/- 0.2 Tulu 2 13b dpo: 88.5 +/- 0.4, 13.3 +/- 0.2

And with the old code (again running three times): Tulu 2 13b: 79.3 +/- 0.4 Tulu 2 13b dpo: 88.3 +/- 0.2

So I think the 'old' results before were maybe just a little lucky, and we see that the old and new code seems to match performance quite closely (within .5 points). It seems that AlpacaEval also has a variance lower than I expected (< 1 pt).

Hopefully this is enough, I'm going to merge.