allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
373 stars 47 forks source link

Add A New Generative Model #182

Closed ToSev7en closed 1 week ago

ToSev7en commented 1 week ago

Hi RewardBench Team 👋,

We have updated a 70B version generative model:

Our local evaluation metrics for the model is listed as bellow:

{
    "Chat": 0.9692737430167597, 
    "Chat Hard": 0.8837719298245614, 
    "Safety": 0.9324324324324325, 
    "Reasoning": 0.9542546516069188
}

Our hardware and environments:

NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 12.3
NVIDIA A800-SXM4 80G * 4

How to Run Evaluation Script

For this generative model, it's okay to evaluate it with the default scripts/run_generative.py script. Please notice that we need at least 4 gpus to run run_generative.py script, and export VLLM_WORKER_MULTIPROC_METHOD=spawn is required for vLLM multi-gpu inference.

export VLLM_WORKER_MULTIPROC_METHOD=spawn
cd reward-bench
model_name_or_path="Skywork/Skywork-Critic-Llama-3.1-70B"
python scripts/run_generative.py --model $model_name_or_path --trust_remote_code --do_not_save --force_local --num_gpus 4 2>&1 | tee ./evaluation_logs.txt

We would like to add this new generative model to the RewardBench LeaderBoard.

Thank you!

natolambert commented 1 week ago

@ToSev7en I ran this, thx! It'll be there upon leaderboard restart.