allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
440 stars 52 forks source link

fix padding for GRM class #154

Closed YangRui2015 closed 4 months ago

YangRui2015 commented 4 months ago

I noticed that the evaluation results for Ray2333/GRM-Gemma-2B-sftreg differ significantly from my local results due to the padding side issue (I use a batch size of 1, so I didn't encounter this problem).

This pull request addresses the padding issue by checking the conditions of left and right paddings to correctly project the score from the last token of each prompt-response pair.

natolambert commented 4 months ago

Ah great find @YangRui2015. Padding is tough with these. Lot's of little things like this that can make a big difference.

YangRui2015 commented 4 months ago

I couldn't agree more. Could you please rerun the model "Ray2333/GRM-Gemma-2B-sftreg"? I have verified that the other results are consistent. Thank you very much!