Closed ToSev7en closed 2 months ago
Hi RewardBench Team,
We have updated a new generative model:
Our local evaluation metrics for the model is listed as bellow:
# Skywork-Critic-Llama-3.1-8B: { 'Chat': 0.9385474860335196, 'Chat Hard': 0.8135964912280702, 'Safety': 0.9159293787293787, 'Reasoning': 0.8975350575653408 }
For the generative model, it's okay to evaluate it with the default scripts/run_generative.py script.
scripts/run_generative.py
cd reward-bench model_name_or_path="Skywork/Skywork-Critic-Llama-3.1-8B" python scripts/run_generative.py --model $model_name_or_path --trust_remote_code --do_not_save --force_local --num_gpus 1 2>&1 | tee ./evaluation_logs.txt
We would like to add this new generative model to the RewardBench LeaderBoard.
Thank you!
Hi RewardBench Team,
We have updated a new generative model:
Our local evaluation metrics for the model is listed as bellow:
For the generative model, it's okay to evaluate it with the default
scripts/run_generative.py
script.We would like to add this new generative model to the RewardBench LeaderBoard.
Thank you!