allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
301 stars 33 forks source link

fix padding for GRM class #154

Closed YangRui2015 closed 2 weeks ago

YangRui2015 commented 2 weeks ago

I noticed that the evaluation results for Ray2333/GRM-Gemma-2B-sftreg differ significantly from my local results due to the padding side issue (I use a batch size of 1, so I didn't encounter this problem).

This pull request addresses the padding issue by checking the conditions of left and right paddings to correctly project the score from the last token of each prompt-response pair.

natolambert commented 2 weeks ago

Ah great find @YangRui2015. Padding is tough with these. Lot's of little things like this that can make a big difference.

YangRui2015 commented 2 weeks ago

I couldn't agree more. Could you please rerun the model "Ray2333/GRM-Gemma-2B-sftreg"? I have verified that the other results are consistent. Thank you very much!