Open vincezh2000 opened 1 month ago
Yes, dpo does not leverage the absolute value of reward but only the ranking information. This is also natural for semi-supervised learning where we only use a hard version of binary signal (win v.s. lose) to reduce noise.
In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional RM+RLHF process incorporates these scalar values into the loss function. For example, if the reward r is 8 versus 80, the outcomes will differ accordingly.
However, with the DPO method, the training objective is based on DPO loss, which does not explicitly calculate the reward scalar. It seems that the only information used is the preference — that A is preferred over B. The paper does not provide specific details on how this is handled.
My question is: If we use the ArmoRM model for training with the iterative DPO method, will it still only use the information about which score is higher, rather than the actual scalar reward values? Is it sufficient to fully utilize the multi-object RM by just using it to label the preference pair?