allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
440 stars 52 forks source link

Add new reward models #169

Closed chrisliu298 closed 2 months ago

chrisliu298 commented 2 months ago

We would like to add two new reward models to the RewardBench leaderboard. Both models are sequence classifiers:

Our local evaluation for both models is listed as follows:

Skywork-Reward-Llama-3.1-8B: {'Chat': 0.9608938547486033, 'Chat Hard': 0.8728070175438597, 'Safety': 0.9058437346437346, 'Reasoning': 0.9614741910841926}
Skywork-Reward-Gemma-2-27B: {'Chat': 0.9581005586592178, 'Chat Hard': 0.9144736842105263, 'Safety': 0.9199634959634959, 'Reasoning': 0.9615696786162492}

When we ran the evaluation script, we manually added the attn_implementation argument to the model_kwargs because we observed issues with gemma-2-27b-it using the default spda attention (which causes severe performance degradation, as mentioned here). Therefore, we recommend using this setup when verifying the performance of our models.

Both models are sequence classifiers that utilize Hugging Face's native implementation, so no additional code is required to implement the pipeline.

Thank you!

natolambert commented 2 months ago

Thanks @chrisliu298. Do you want to open a PR that makes this kwarg passed in the right place in run_rm. Likely an if statement, unfortunately.

Let me know, otherwise I'll get to it soon.

chrisliu298 commented 2 months ago

Sounds good. I've submitted a PR at #170 and have tested it with the following commands:

python scripts/run_rm.py --model Skywork/Skywork-Reward-Llama-3.1-8B --batch_size 1 --torch_dtype bfloat16 --do_not_save --attn_implementation flash_attention_2
python scripts/run_rm.py --model Skywork/Skywork-Reward-Gemma-2-27B --batch_size 1 --torch_dtype bfloat16 --do_not_save --attn_implementation flash_attention_2
natolambert commented 2 months ago

Also @chrisliu298, RE:

We removed the BOS token from the chat templates of the two models to prevent it being added twice during apply_chat_template and tokenization.

I added a somewhat hacky check in the pipeline that turns off the attention mask on the repeated BOS token. https://github.com/allenai/reward-bench/pull/166

chrisliu298 commented 2 months ago

Ah, that's great. We added the note mainly to let people know about this in their own usage because we did not initially realize this problem. I'll add another note to the README later to further clarify this.

Also, let me know if you think it's better, or if it's the norm in general, to include the BOS token the template anyway. I assume this does not affect the RewardBench evaluation.

natolambert commented 2 months ago

At this point reward model norms are still really flexible. I think having it in the chat template is okay, its a design flaw with how RewardBench was built that we handle tokenization like a pipeline rather in the data curation phase. Makes it easier to debug.

chrisliu298 commented 2 months ago

That makes sense. We'll leave it as is for now.

We would greatly appreciate if you could add our models to the leaderboard once the verification is complete.

natolambert commented 2 months ago

Already ran one! May need to create a new docker image with flash attention for the gemma model, but should be done soon. Then I'll restart the leaderboard.

natolambert commented 2 months ago

Models are live! Congrats!

chrisliu298 commented 2 months ago

Thank you for conducting the evaluation for us!