The Skywork preference dataset demonstrates that a small high-quality dataset can lead to powerful reward models, which is promising. By finetuning Ray2333/GRM-Gemma-2B-sftreg on this dataset, we obtain a SOTA 2B reward model that can even surpass gpt4 as a judge.
Evaluation
We evaluate GRM-Gemma-2B-rewardmodel-ft on the reward model benchmark, where it achieved SOTA performance among models smaller than 6B.
Model
Average
Chat
Chat Hard
Safety
Reasoning
Ray2333/GRM-Gemma-2B-rewardmodel-ft (Ours, 2B)
84.7
89.4
75.2
85.5
88.8
openai/gpt-4o-2024-05-13
84.6
96.6
70.4
86.5
84.9
sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B)
84.4
99.4
65.1
86.8
86.4
Nexusflow/Starling-RM-34B (34B)
82.6
96.9
57.2
87.7
88.5
Ray2333/Gemma-2B-rewardmodel-ft (Ours, 2B)
80.5
77.9
74.8
85.2
84.0
Ray2333/GRM-Gemma-2B-sftreg (2B)
75.3
95.5
48.7
80.0
76.8
berkeley-nest/Starling-RM-7B-alpha (7B)
74.6
98
43.4
88.6
74.6
Ray2333/Gemma-2B-rewardmodel-baseline(2B)
73.7
94.1
46.1
79.6
75.0
stabilityai/stablelm-zephyr-3b (3B)
73.1
86.3
60.1
70.3
75.7
Run
When evaluated using reward bench, please add '--not_quantized' to avoid performance drop.
We provide two new 2B reward models and a 8B reward model.
Introduction
This reward model is finetuned from the Ray2333/GRM-Gemma-2B-sftreg using the Skywork preference dataset.
The Skywork preference dataset demonstrates that a small high-quality dataset can lead to powerful reward models, which is promising. By finetuning Ray2333/GRM-Gemma-2B-sftreg on this dataset, we obtain a SOTA 2B reward model that can even surpass gpt4 as a judge.
Evaluation
We evaluate GRM-Gemma-2B-rewardmodel-ft on the reward model benchmark, where it achieved SOTA performance among models smaller than 6B.
Run
When evaluated using reward bench, please add '--not_quantized' to avoid performance drop.
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-Gemma-2B-rewardmodel-ft --chat_template=gemma --batch_size=${batch_size} --not_quantized
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/Gemma-2B-rewardmodel-ft --chat_template=gemma --batch_size=${batch_size} --not_quantized
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-Llama3-8B-rewardmodel-ft --chat_template=llama-3 --batch_size=${batch_size} --not_quantized