Open richhh520 opened 4 months ago
Hi @richhh520 ,
After the instruactable reward model is trained, we expect it to follow arbitrary principles. Therefore, when we find some issues at the RL phase, we would like to directly prompt the instruactable RM with different principles to directly steer its preference, as well as the RL model's behavior
Dear author,
Why "we use a set of principles different from the reward model training stage, as illustrated in Table 8, which contains a few more principles that we would expect a well-aligned LLM AI-assistant agent would behave." ? Thanks for your explanation!