Closed wulaoshi closed 10 months ago
Thanks for the kind words. In the equation you posted, the probability is the probability of the entire response sequence, conditioned on the input. The per-token logprobs need to be summed (along the sequence length dimension) to get the total log probability of the chosen/rejected sequences. The sum is implicit in the equation (i.e., the total log probabilities are the sum of the per-token log probabilities).
Feel free to re-open if this didn't answer your question!
Hi @eric-mitchell, Should the variable length of the chose/rejected sequences be taken into account in the loss? Any comments on this is highly appreciated.
Hi @eric-mitchell , In your formula (the image below), it seems that the log[π(y|x)] was calculate through .sum(-1) after logits.softmax(-1), then .log(). But in your codes (the image below), the log[π(y|x)] was calculate through .sum(-1) after logits.log_softmax(-1).
the two ways to calculate log[π(y|x)] seem different.Could you please tell me if they conflict each other?
First of all, I'd like to express my gratitude for the amazing work done by the author. While going through the DPO code, I came across a point of confusion: I understand that π(y|x) represents the probability distribution when the model generates a response, but in the code implementation, there's an additional sum(-1) operation. However, I don't see summation operation in the formula, as shown in the image below:
Could you please help me understand the logic behind this implementation? Thank you!