allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
281 stars 28 forks source link

stanfordnlp/SteamSHP-flan-t5 performance on SHP and HH-RLHF Helpful #81

Closed timbmg closed 3 months ago

timbmg commented 3 months ago

Hi, thanks for this great work, its really interessting and helpful!

I was a bit surprised by the stanfordnlp/SteamSHP-flan-t5-xl and stanfordnlp/SteamSHP-flan-t5-large performance on the SHP dataset in Table 12, because their self reported accuracy is 0.7278 and 0.7203 respectively. Do you know the reason for this difference?

(AFAIK, their reported average also includes the performance on HH-RLHF helpful-base, but I dont think that should drag the performance down that much?)

Vice versa, the HH-RLHF helpful scores in Table 12 are much lower than the reported ones on huggingface (0.731 vs 0.633 and 0.731 vs 0.629).

Screenshot 2024-03-22 at 15 26 07
natolambert commented 3 months ago

I'll look further @timbmg. Some specific things, some open questions here.

  1. Our SHP test set is a smaller curated subset (as it would be huge otherwise). From the prior pref sets dataset card. In short, we make the test set less noisy (I'm happy to see the numbers are higher tbh):

    Stanford Human Preferences (SHP), with a subset created by taking 1 sample per prompt with a score ratio above 1.5 and a total number of Reddit votes above 10.

  2. I feel like Anthropic HH can catch a lot of people on chat templates. We should check their implementation for the dataset.
  3. Given the SHP model is a little weird to run, there could be bugs. If you have time to check our implementation here it would be great. It is mostly copied from their code tbf. https://github.com/allenai/reward-bench/blob/main/rewardbench/models/shp.py