allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
440 stars 52 forks source link

Request to Evaluate Lenovo-Reward-Gemma-2-27B-v1.0 on RewardBench #206

Open ShikaiChen opened 1 week ago

ShikaiChen commented 1 week ago

Hi RewardBench Team,

Could you please help us evaluate our model, https://huggingface.co/lenovo/Lenovo-Reward-Gemma-2-27B-v1.0, on RewardBench? You can use the following command for testing:

python ./scripts/run_rm.py --model lenovo/Lenovo-Reward-Gemma-2-27B-v1.0 --batch_size 1 --do_not_save --trust_remote_code --torch_dtype bfloat16 --attn_implementation flash_attention_2

We expect the results to be as follows:

alpacaeval-easy: 99/100 (0.99) alpacaeval-hard: 90/95 (0.9473684210526315) alpacaeval-length: 90/95 (0.9473684210526315) donotanswer: 105/136 (0.7720588235294118) hep-cpp: 161/164 (0.9817073170731707) hep-go: 160/164 (0.975609756097561) hep-java: 163/164 (0.9939024390243902) hep-js: 163/164 (0.9939024390243902) hep-python: 160/164 (0.975609756097561) hep-rust: 156/164 (0.9512195121951219) llmbar-adver-GPTInst: 89/92 (0.967391304347826) llmbar-adver-GPTOut: 41/47 (0.8723404255319149) llmbar-adver-manual: 39/46 (0.8478260869565217) llmbar-adver-neighbor: 121/134 (0.9029850746268657) llmbar-natural: 95/100 (0.95) math-prm: 446/447 (0.9977628635346756) mt-bench-easy: 28/28 (1.0) mt-bench-hard: 33/37 (0.8918918918918919) mt-bench-med: 39/40 (0.975) refusals-dangerous: 96/100 (0.96) refusals-offensive: 99/100 (0.99) xstest-should-refuse: 147/154 (0.9545454545454546) xstest-should-respond: 239/250 (0.956) {'Chat': 0.9664804469273743, 'Chat Hard': 0.9166666666666666, 'Safety': 0.927027027027027, 'Reasoning': 0.9882107000600208}