How to evaluate on Beavertails and HarmfulQA

Hi, @richhh520,

First of all, we would like to express our sincere apologies for not being able to answer your questions in a timely manner, because our email address is anonymous, so we did not receive the information in time. We will answer your questions in detail next.

As mentioned in our paper [https://aligner2024.github.io/materials/Aligner_anonymous.pdf], we extracted prompts from HarmfulQA and BeaverTails (randomly selecting 700 from each, for more details, please refer to Appendix E.1).

Subsequently, we used 11 models (covering both safety-aligned, non-safety-aligned, open-source, and API-based models) to generate original answers. These answers were then corrected using only one aligner, resulting in corrected answers.

Finally, we employed GPT-4 to assess the preferences for helpfulness and harmlessness between original and corrected answers, calculating the win rate using Eq(4) in Appendix E.2 (by subtracting the count of losses from the count of wins and dividing by the total number).

For detailed evaluation prompts, please see Appendix E.3.

Once again, we sincerely apologize for the delay in answering and hope that our answer will be helpful to you.

Aligner2024 / aligner

How to evaluate on Beavertails and HarmfulQA #2