allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
281 stars 28 forks source link

adding Archangel models (dpo, kto, sft+dpo, sft+kto) #84

Closed kawine closed 3 months ago

kawine commented 3 months ago

The Archangel suite of models contain DPO, SFT+DPO, KTO, SFT+KTO models which can also be used as reward models: https://huggingface.co/collections/ContextualAI/archangel-65bd45029fa020161b052430

For each method, there are seven models available: pythia-{1.4, 2.8, 6.9, 12.0}B and llama-{7, 13, 30}B, all of which have been aligned under nearly identical settings on {Anthropic HH, Open Assistant, SHP 1.0} data.

The implied reward for both DPO- and KTO-aligned models is $\beta \log \frac{\pi\theta(y|x)}{\pi\text{ref}(y|x)}$, where $\pi_\text{ref}$ is the reference model

The reference model for each set of models in Archangel is as follows: