allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
277 stars 27 forks source link

Clarification Needed on DPO Reward Evaluation #116

Closed ZHZisZZ closed 2 months ago

ZHZisZZ commented 2 months ago

Thank you for providing such a valuable benchmark.

I am seeking clarification on the model/reference specifications for DPO rewards, which are not readily apparent in either the paper or the leaderboard. For example, it is unclear whether models like Llama-3-8B-Instruct, Qwen, and Zephyr were evaluated without or with references. If references were used, could you please provide guidance on how to access the reference model?

Thank you for your assistance.

natolambert commented 2 months ago

Hey! @ZHZisZZ, good point, let's improve the docs. I committed an improvement to the leaderboard just now. TLDR:

  1. All models on the leaderboard are computed with a reference model. Reference free results are currently only in a coming version of the paper (not even updated on arxiv yet).
  2. The reference models are all in the reward bench results repo. Specially, you must click on a specific model and you'll see a key ref_model, e.g. Qwen. Alternatively, all the reference models are listed in the evaluation configs.

If you go to a folder, you can see the ref free results are separate.

natolambert commented 2 months ago

The reference free results can also be loaded with this script https://github.com/allenai/reward-bench/blob/main/analysis/get_dpo_ref_free_results.py

ZHZisZZ commented 2 months ago

Thank you for your clarifications. I figured out that some reference models are pre-trained models, which may not be the actual reference models that were used for training. This can result in some issues because these pre-trained models do not always recognize the prompt template for the fine-tuned models.

And, out of curiosity, I have some extra questions:

1) Based on your experience, do these pre-trained reference models offer advantages over the reference-free version?

2) If so, then probably we can use an arbitrary pre-trained model as a ref as long as it shares the same vocabulary as the fine-tuned model (although page 11 of the paper says using the wrong ref may lead to bad performances).

natolambert commented 2 months ago

@ZHZisZZ I think it'll be a mixed bag. You'll probably gain 1-5% performance by using the SFT checkpoint vs the base model. Some models aren't documented well and may just not have an SFT checkpoint.

Swapping the wrong reference model we tried (accidentally) and performance entirely regressed to random on a few models.