Aligner2024 / aligner

Achieving Efficient Alignment through Learned Correction
https://aligner2024.github.io/
100 stars 5 forks source link

question about the paper #6

Closed Ruibn closed 3 weeks ago

Ruibn commented 2 months ago

Hi @Aligner2024 ,

May I know how to calculate the harmlessness and helpfulness score in Figure 2? And noticed you changed the equation (2), may I know the reason?

And the code here only cover the aligner training journey, will you share evaluation code later for us to easily reproduce the evaluation scenario in the paper?

And in the integrated testing, what is the prompt for aligner to retrieve the corrected answer?

Thanks

Aligner2024 commented 1 month ago

Hi @Ruibn,

First of all, we sincerely apologize for the delay in responding to your questions. Due to the anonymity of our email address, we did not receive the information in time. We will now provide detailed answers to your inquiries.

Regarding the calculation methods for the helpfulness and harmlessness scores in the figure, following the SafeRLHF method, we use publicly available preference datasets to train reward and cost models. These models represent helpfulness and harmlessness, respectively. This evaluation method is widely used in practice, as seen in Llama2-Family and SafeRLHF.

Concerning the evaluation code, we apologize for missing this part and we will add them. Specifically, we first use vLLM to generate upstream model answers, which are then corrected using Aligner. Finally, GPT-4 and human evaluators score the answers. Further details about the evaluation process can be found in Appendix C, and we will release our code soon.

In actual practice, the prompt for Aligner is as follows:

BEGINNING OF CONVERSATION: USER: Question:{question} Answer:{answer} Revision: ASSISTANT:

It is important to note that this prompt is identical to the final training script, ensuring consistency and unbiased results.

Once again, we sincerely apologize for the delay in our response and hope that our answers are helpful to you.