Closed andrewsiah closed 7 months ago
Ah, @andrewsiah it's all one evaluation set (no training split). Yeah, missed this line, the final dataset is here. https://huggingface.co/datasets/allenai/reward-bench That link was just a temporary home. Will update it now.
Hi @natolambert et al,
We are reading the paper and the 2.98K filtered dataset at huggingface.
https://huggingface.co/datasets/allenai/reward-bench
I am curious if the huggingface 2.98K filtered data is the actual evaluation data used to evaluate on the leaderboard?
Cause I looked into the code and saw this line in
utils.py
.When I went to ai2-adapt-dev, I saw that it is a private dataset.
Asking cause we're hoping to know if we can/should train on the huggingface dataset for our reward model to fairly compare on the leaderboard.
Thanks!