Closed kawine closed 7 months ago
My take is that because it's run exactly the same was as a DPO model, it's worth adding to the documentation but not sure it's worth the difference yet. If they start to diverge, we can revisit it in the future.
the implied reward is upstream of the DPO objective, right? like you can calculate this with any pair of {base, finetuned/aligned models}, even ones finetuned with just SFT, and it should work (maybe not as well)
i would call it "implied RLHF reward" or something, since technically it predates the dpo work
Interesting, do you have a link @kawine -- i'd like to read some more of this history? But yeah, it's colloquial recognized as DPO but it seems like re-writing history then.
i think the DPO paper cites a few different papers as its inspiration in this respect, but the most clear precedent imo was set in a paper by Korbak et al (thm 1): https://arxiv.org/abs/2206.00761
can KTO be added as a separate model type on the leaderboard?