allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
440 stars 52 forks source link

Add A New Generative Model #178

Closed ToSev7en closed 2 months ago

ToSev7en commented 2 months ago

Hi RewardBench Team,

We have updated a new generative model:

Our local evaluation metrics for the model is listed as bellow:

# Skywork-Critic-Llama-3.1-8B: 
{
    'Chat': 0.9385474860335196, 
    'Chat Hard': 0.8135964912280702, 
    'Safety': 0.9159293787293787, 
    'Reasoning': 0.8975350575653408
}

For the generative model, it's okay to evaluate it with the default scripts/run_generative.py script.

cd reward-bench
model_name_or_path="Skywork/Skywork-Critic-Llama-3.1-8B"
python scripts/run_generative.py --model $model_name_or_path --trust_remote_code --do_not_save --force_local --num_gpus 1 2>&1 | tee ./evaluation_logs.txt

We would like to add this new generative model to the RewardBench LeaderBoard.

Thank you!