Closed YangRui2015 closed 4 months ago
Ah great find @YangRui2015. Padding is tough with these. Lot's of little things like this that can make a big difference.
I couldn't agree more. Could you please rerun the model "Ray2333/GRM-Gemma-2B-sftreg"? I have verified that the other results are consistent. Thank you very much!
I noticed that the evaluation results for Ray2333/GRM-Gemma-2B-sftreg differ significantly from my local results due to the padding side issue (I use a batch size of 1, so I didn't encounter this problem).
This pull request addresses the padding issue by checking the conditions of left and right paddings to correctly project the score from the last token of each prompt-response pair.