allenai / reward-bench

RewardBench: the first evaluation tool for reward models.
https://huggingface.co/spaces/allenai/reward-bench
Apache License 2.0
442 stars 52 forks source link

adding kto as a separate category #100

Closed kawine closed 7 months ago

kawine commented 7 months ago

can KTO be added as a separate model type on the leaderboard?

natolambert commented 7 months ago

My take is that because it's run exactly the same was as a DPO model, it's worth adding to the documentation but not sure it's worth the difference yet. If they start to diverge, we can revisit it in the future.

kawine commented 7 months ago

the implied reward is upstream of the DPO objective, right? like you can calculate this with any pair of {base, finetuned/aligned models}, even ones finetuned with just SFT, and it should work (maybe not as well)

i would call it "implied RLHF reward" or something, since technically it predates the dpo work

natolambert commented 7 months ago

Interesting, do you have a link @kawine -- i'd like to read some more of this history? But yeah, it's colloquial recognized as DPO but it seems like re-writing history then.

kawine commented 7 months ago

i think the DPO paper cites a few different papers as its inspiration in this respect, but the most clear precedent imo was set in a paper by Korbak et al (thm 1): https://arxiv.org/abs/2206.00761