allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
442 stars 52 forks source link

Add new reward models #202

Closed YangRui2015 closed 1 month ago

YangRui2015 commented 1 month ago

Currently, no reward models smaller than 7B have ranked in the top 30 on Reward-Bench. We provide new 2B and 3B reward models that are comparable to large-scale reward models.

Please add the following code into reward-bench/rewardbench/models/init.py (other two fine-tuned models do not need this modification).

"Ray2333/GRM-Gemma2-2B-sftreg": {
        "model_builder": GRewardModel.from_pretrained,
        "pipeline_builder": GRMPipeline,
        "quantized": False,
        "custom_dialogue": False,
        "model_type": "Seq. Classifier",
    },
    "Ray2333/GRM-llama3.2-3B-sftreg": {
        "model_builder": GRewardModel.from_pretrained,
        "pipeline_builder": GRMPipeline,
        "quantized": False,
        "custom_dialogue": False,
        "model_type": "Seq. Classifier",
    },

Training Details: Ray2333/GRM-llama3.2-3B-sftreg is trained on hendrydong/preference_700K. Ray2333/GRM-Gemma2-2B-sftreg is trained on weqweasdas/preference_dataset_mixture2_and_safe_pku. Ray2333/GRM-Llama3.2-3B-rewardmodel-ft and Ray2333/GRM-gemma2-2B-rewardmodel-ft are fine-tuned on the decontaminated dataset Skywork/Skywork-Reward-Preference-80K-v0.2.

The evaluation commands are:

CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-Gemma2-2B-sftreg --batch_size=8 --not_quantized

CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-llama3.2-3B-sftreg --batch_size=8 --not_quantized

CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-gemma2-2B-rewardmodel-ft --batch_size=8 --not_quantized

CUDA_VISIBLE_DEVICES=0 python scripts/run_rm.py --model=Ray2333/GRM-llama3.2-3B-rewardmodel-ft --batch_size=8 --not_quantized

The local scores are:

Model Average Chat Chat Hard Safety Reasoning
Ray2333/GRM-Llama3.2-3B-rewardmodel-ft(3B) 90.9 91.6 84.9 92.7 94.6
Ray2333/GRM-gemma2-2B-rewardmodel-ft (2B) 88.4 93.0 77.2 92.2 91.2
Ray2333/GRM-llama3.2-3B-sftreg(3B) 85.8 96.4 67.1 88.2 91.6
Ray2333/GRM-Gemma2-2B-sftreg(2B) 81.0 97.2 59.6 86.9 80.3