Question about MODPO training with only reward Models and no chosen/rejected (preference) data

Dear Authors,

Firstly, I want to thank you for your great work on the MODPO method and for making it available to the community. The concept is very interesting and seems to be highly practical for applications. But, when I tried to learn and apply it, I had a specific question that I could not figure out.

I tried to train a language model using MODPO in a setup where I only have two reward models and a dataset consisting only of raw prompts. There are no explicit chosen/rejected samples or preference pairs available in my data. According to the paper, this setup should be possible, but I am unsure how exactly to configure the training process correctly.

My Questions:

How can MODPO handle training when only raw prompts are given? Should the responses to the prompts be generated dynamically during training? How should the margins be calculated from the two reward models, without having labeled data?
How are the two reward models integrated into the training process in this case? What is the correct way to set up the margin computation and loss function?
Could you kindly give an example code snippet that shows how to train MODPO in this context? A concrete example or tutorial would be very helpful, as I am struggling a lot with this part
Must the provided scripts (e.g., modpo.py) be adjusted in some way to handle a dataset where only raw prompts exist?

Thank you very much for your time and your support. I would be very glad to receive your help and guidance on this topic.

With best regards

You can use Equation 19 from our paper to address scenarios where only reward models are available and no preference data exists. Below is an example implementation:

def rpo_loss(
        self,
        policy_first_logps: torch.FloatTensor,
        policy_second_logps: torch.FloatTensor,
        reference_first_logps: torch.FloatTensor,
        reference_second_logps: torch.FloatTensor,
        first_margin_reward: torch.FloatTensor,
        second_margin_reward: torch.FloatTensor,
    ) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
    """Compute the RPO loss for a batch of policy and reference model log probabilities.

    Args:
        policy_first_logps: Log probabilities of the policy model for the first responses. Shape: (batch_size,)
        policy_second_logps: Log probabilities of the policy model for the second responses. Shape: (batch_size,)
        reference_first_logps: Log probabilities of the reference model for the first responses. Shape: (batch_size,)
        reference_second_logps: Log probabilities of the reference model for the second responses. Shape: (batch_size,)
        first_margin_reward: Margin reward scores for the first responses. Shape: (batch_size,)
        second_margin_reward: Margin reward scores for the second responses. Shape: (batch_size,)

    Returns:
        A tuple of three tensors: (losses, first_rewards, second_rewards).
        The losses tensor contains the RPO loss for each example in the batch.
        The first_rewards and second_rewards tensors contain the rewards for the first and second responses, respectively.
    """
    first_rewards = self.beta * (policy_first_logps - reference_first_logps).to(self.acceleratordevice)
    second_rewards = self.beta * (policy_second_logps - reference_second_logps).to(selfaccelerator.device)

    student_margin = first_rewards - second_rewards

    teacher_margin = first_margin_reward - second_margin_reward
    diff = student_margin - teacher_margin
    losses = -F.logsigmoid(diff)

    return losses, first_rewards, second_rewards, student_margin, teacher_margin, diff

In the code above, first and second are the model's two responses to the same prompt. RPO means random preference optimization.

A loss with smaller variance is (we recommend using this loss):

student_margin = first_rewards - second_rewards
teacher_margin = first_margin_reward - second_margin_reward
diff = student_margin - teacher_margin
losses = -F.logsigmoid(diff) - F.logsigmoid(-diff)

Eq. 19 in our paper can be viewed as a variant to the concurrently proposed regression loss [1,2], which optimizes language models via regressing reward differences.

[1] Gao, Zhaolin, et al. "Rebel: Reinforcement learning via regressing relative rewards." arXiv preprint arXiv:2404.16767 (2024). [2] Fisch, Adam, et al. "Robust preference optimization through reward model distillation." arXiv preprint arXiv:2405.19316 (2024). @ZHZisZZ

ZHZisZZ / modpo

Question about MODPO training with only reward Models and no chosen/rejected (preference) data #3

My Questions: