stanfordnlp/SteamSHP-flan-t5 performance on SHP and HH-RLHF Helpful

allenai / reward-bench

RewardBench: the first evaluation tool for reward models.

Apache License 2.0

281 stars 28 forks source link

I'll look further @timbmg. Some specific things, some open questions here.

Our SHP test set is a smaller curated subset (as it would be huge otherwise). From the prior pref sets dataset card. In short, we make the test set less noisy (I'm happy to see the numbers are higher tbh):

Stanford Human Preferences (SHP), with a subset created by taking 1 sample per prompt with a score ratio above 1.5 and a total number of Reddit votes above 10.
I feel like Anthropic HH can catch a lot of people on chat templates. We should check their implementation for the dataset.
Given the SHP model is a little weird to run, there could be bugs. If you have time to check our implementation here it would be great. It is mostly copied from their code tbf. https://github.com/allenai/reward-bench/blob/main/rewardbench/models/shp.py

allenai / reward-bench