Currently, no reward models smaller than 7B have ranked in the top 30 on Reward-Bench. We provide new 2B and 3B reward models that are comparable to large-scale reward models.
Please add the following code into reward-bench/rewardbench/models/init.py (other two fine-tuned models do not need this modification).
Training Details:
Ray2333/GRM-llama3.2-3B-sftreg is trained on hendrydong/preference_700K.
Ray2333/GRM-Gemma2-2B-sftreg is trained on weqweasdas/preference_dataset_mixture2_and_safe_pku.
Ray2333/GRM-Llama3.2-3B-rewardmodel-ft and Ray2333/GRM-gemma2-2B-rewardmodel-ft are fine-tuned on the decontaminated dataset Skywork/Skywork-Reward-Preference-80K-v0.2.
Currently, no reward models smaller than 7B have ranked in the top 30 on Reward-Bench. We provide new 2B and 3B reward models that are comparable to large-scale reward models.
Please add the following code into reward-bench/rewardbench/models/init.py (other two fine-tuned models do not need this modification).
Training Details: Ray2333/GRM-llama3.2-3B-sftreg is trained on hendrydong/preference_700K. Ray2333/GRM-Gemma2-2B-sftreg is trained on weqweasdas/preference_dataset_mixture2_and_safe_pku. Ray2333/GRM-Llama3.2-3B-rewardmodel-ft and Ray2333/GRM-gemma2-2B-rewardmodel-ft are fine-tuned on the decontaminated dataset Skywork/Skywork-Reward-Preference-80K-v0.2.
The evaluation commands are:
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-Gemma2-2B-sftreg --batch_size=8 --not_quantized
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-llama3.2-3B-sftreg --batch_size=8 --not_quantized
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-gemma2-2B-rewardmodel-ft --batch_size=8 --not_quantized
CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-llama3.2-3B-rewardmodel-ft --batch_size=8 --not_quantized
The local scores are: