IBM / SALMON

Self-Alignment with Principle-Following Reward Models
https://arxiv.org/abs/2310.05910
GNU General Public License v3.0
148 stars 14 forks source link

A question about the paper #4

Open richhh520 opened 4 months ago

richhh520 commented 4 months ago

Dear author,

Why "we use a set of principles different from the reward model training stage, as illustrated in Table 8, which contains a few more principles that we would expect a well-aligned LLM AI-assistant agent would behave." ? Thanks for your explanation!

Edward-Sun commented 4 months ago

Hi @richhh520 ,

After the instruactable reward model is trained, we expect it to follow arbitrary principles. Therefore, when we find some issues at the RL phase, we would like to directly prompt the instruactable RM with different principles to directly steer its preference, as well as the RL model's behavior