Is eval set on huggingface the eval set or train set?

allenai / reward-bench

RewardBench: the first evaluation tool for reward models.

Apache License 2.0

277 stars 27 forks source link

Hi @natolambert et al,

We are reading the paper and the 2.98K filtered dataset at huggingface.

Screenshot 2024-04-13 at 9 45 51 PM

I am curious if the huggingface 2.98K filtered data is the actual evaluation data used to evaluate on the leaderboard?

Cause I looked into the code and saw this line in utils.py.

CORE_EVAL_SET = "ai2-adapt-dev/rm-benchmark-dev"
EXTRA_PREF_SETS = "allenai/pref-test-sets"

When I went to ai2-adapt-dev, I saw that it is a private dataset.

Asking cause we're hoping to know if we can/should train on the huggingface dataset for our reward model to fairly compare on the leaderboard.

Thanks!

allenai / reward-bench