question about the paper

Hi @Ruibn,

First of all, we sincerely apologize for the delay in responding to your questions. Due to the anonymity of our email address, we did not receive the information in time. We will now provide detailed answers to your inquiries.

Regarding the calculation methods for the helpfulness and harmlessness scores in the figure, following the SafeRLHF method, we use publicly available preference datasets to train reward and cost models. These models represent helpfulness and harmlessness, respectively. This evaluation method is widely used in practice, as seen in Llama2-Family and SafeRLHF.

Concerning the evaluation code, we apologize for missing this part and we will add them. Specifically, we first use vLLM to generate upstream model answers, which are then corrected using Aligner. Finally, GPT-4 and human evaluators score the answers. Further details about the evaluation process can be found in Appendix C, and we will release our code soon.

In actual practice, the prompt for Aligner is as follows:

BEGINNING OF CONVERSATION: USER: Question:{question} Answer:{answer} Revision: ASSISTANT:

It is important to note that this prompt is identical to the final training script, ensuring consistency and unbiased results.

Once again, we sincerely apologize for the delay in our response and hope that our answers are helpful to you.

Aligner2024 / aligner

question about the paper #6