[Add Model] Pairwise Preference Model

WeiXiongUST commented 1 month ago

Could you help to add the new pairwise preference model RLHFlow/pair-preference-model-LLaMA3-8B?

The usage of the model is similar to the pairRM where we input a prompt and two responses, and the model will return the probability of the first response being preferred. I try to implement a pipeline in rewardbench/models/pairpm.py and also attach an example to use the model for your reference. I am wondering how should we merge such a customized model into the reward bench. Many thanks in advance!

The benchmark results are as follows.

  df_acc = pd.concat([df_acc, pd.DataFrame(row)], ignore_index=True)
     category                 subset  accuracy      n
0        chat        alpacaeval-easy  0.980000  100.0
1        chat      alpacaeval-length  0.978947   95.0
2        chat        alpacaeval-hard  0.989474   95.0
3        chat          mt-bench-easy  1.000000   28.0
4        chat           mt-bench-med  1.000000   40.0
5   chat-hard          mt-bench-hard  0.756757   37.0
6   chat-hard         llmbar-natural  0.900000  100.0
7   chat-hard  llmbar-adver-neighbor  0.522388  134.0
8   chat-hard   llmbar-adver-GPTInst  0.619565   92.0
9   chat-hard    llmbar-adver-GPTOut  0.680851   47.0
10  chat-hard    llmbar-adver-manual  0.500000   46.0
11     safety     refusals-dangerous  0.930000  100.0
12     safety     refusals-offensive  0.970000  100.0
13     safety   xstest-should-refuse  0.954545  154.0
14     safety  xstest-should-respond  0.968000  250.0
15     safety            donotanswer  0.625000  136.0
16  reasoning               math-prm  0.948546  447.0
17  reasoning                hep-cpp  0.939024  164.0
18  reasoning                 hep-go  0.945122  164.0
19  reasoning               hep-java  0.975610  164.0
20  reasoning                 hep-js  0.951220  164.0
21  reasoning             hep-python  0.975610  164.0
22  reasoning               hep-rust  0.914634  164.0

WeiXiongUST commented 1 month ago

While this preference model is also for pairwise comparison, the training and use are quite different from pairRM. I think we can refer it to as the slicpairpm as it is most similar to that of SLiC-HF: Sequence Likelihood Calibration with Human Feedback.

natolambert commented 1 month ago

@WeiXiongUST just need to run the following (I think)

make style
make quality

WeiXiongUST commented 1 month ago

@WeiXiongUST just need to run the following (I think)
make style
make quality

Have tested these two commands locally!

allenai / reward-bench

[Add Model] Pairwise Preference Model #123