Closed chrisliu298 closed 2 months ago
Thanks @chrisliu298. Do you want to open a PR that makes this kwarg passed in the right place in run_rm
. Likely an if statement, unfortunately.
Let me know, otherwise I'll get to it soon.
Sounds good. I've submitted a PR at #170 and have tested it with the following commands:
python scripts/run_rm.py --model Skywork/Skywork-Reward-Llama-3.1-8B --batch_size 1 --torch_dtype bfloat16 --do_not_save --attn_implementation flash_attention_2
python scripts/run_rm.py --model Skywork/Skywork-Reward-Gemma-2-27B --batch_size 1 --torch_dtype bfloat16 --do_not_save --attn_implementation flash_attention_2
Also @chrisliu298, RE:
We removed the BOS token from the chat templates of the two models to prevent it being added twice during apply_chat_template and tokenization.
I added a somewhat hacky check in the pipeline that turns off the attention mask on the repeated BOS token. https://github.com/allenai/reward-bench/pull/166
Ah, that's great. We added the note mainly to let people know about this in their own usage because we did not initially realize this problem. I'll add another note to the README later to further clarify this.
Also, let me know if you think it's better, or if it's the norm in general, to include the BOS token the template anyway. I assume this does not affect the RewardBench evaluation.
At this point reward model norms are still really flexible. I think having it in the chat template is okay, its a design flaw with how RewardBench was built that we handle tokenization like a pipeline rather in the data curation phase. Makes it easier to debug.
That makes sense. We'll leave it as is for now.
We would greatly appreciate if you could add our models to the leaderboard once the verification is complete.
Already ran one! May need to create a new docker image with flash attention for the gemma model, but should be done soon. Then I'll restart the leaderboard.
Models are live! Congrats!
Thank you for conducting the evaluation for us!
We would like to add two new reward models to the RewardBench leaderboard. Both models are sequence classifiers:
Our local evaluation for both models is listed as follows:
When we ran the evaluation script, we manually added the
attn_implementation
argument to themodel_kwargs
because we observed issues withgemma-2-27b-it
using the defaultspda
attention (which causes severe performance degradation, as mentioned here). Therefore, we recommend using this setup when verifying the performance of our models.Both models are sequence classifiers that utilize Hugging Face's native implementation, so no additional code is required to implement the pipeline.
Thank you!