How to Run Evaluation Script
For this generative model, it's okay to evaluate it with the default scripts/run_generative.py script. Please notice that we need at least 8 gpus to run run_generative.py script, and export VLLM_WORKER_MULTIPROC_METHOD=spawn is required for vLLM multi-gpu inference.
export VLLM_WORKER_MULTIPROC_METHOD=spawn
cd reward-bench
python scripts/run_generative.py --model=SF-Foundation/TextEval-OffsetBias-12B --num_gpus 8
We would like to add this new generative model to the RewardBench LeaderBoard.
Hi RewardBench Team 👋,
We have updated a 12B version generative model:
SF-Foundation/TextEval-OffsetBias-12B Our local evaluation metrics for the model is listed as bellow:
{ 'Chat': 0.9217877094972067, 'Chat Hard': 0.868421052631579, 'Safety': 0.9238221130221129, 'Reasoning': 0.937493179461996 }
How to Run Evaluation Script For this generative model, it's okay to evaluate it with the default scripts/run_generative.py script. Please notice that we need at least 8 gpus to run run_generative.py script, and export VLLM_WORKER_MULTIPROC_METHOD=spawn is required for vLLM multi-gpu inference.
We would like to add this new generative model to the RewardBench LeaderBoard.
Thank you!