Open LSX-Sneakerprogrammer opened 1 year ago
- If I choose to get average log probability, I consider the pad_to_length function needed to turn off. Is that right?
No. Padded tokens will not be counted. see here https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py#L96C5-L104
- Did you do experiments on this to see which one performs better?
~~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment. For example, if the sum of logits is divided by the average length of 100 tokens roughly, then the beta needs to be increased by 100 times.~~
average_log_prob=False is better
- Did you do experiments on this to see which one performs better?
~I have already tried specifying average_log_prob=True, but the beta value needs a adjustment. For example, if the sum of logits is divided by the average length of 100 tokens roughly, then the beta needs to be increased by 100 times.~
average_log_prob=False is better
Thanks for your reply! I try average_log_prob=False and it seems the model is more likely to generate long responses compared to original. I want to avoid this problem and try average_log_prob=True, but after training, the model turns to generate with repeat words. Have you meet this problem and know how to solve it? Thanks a lot!
I ran into the same issue a few months ago and didn't have any success with average_log_prob=True
-- the model became very degenerative. Ultimately I left average_log_prob=False
and had to add some extra tricks to keep DPO from teaching the model to write very long responses.
Hi all,
Could somebody please explain to me the reason why average_log_prob=False
make model to generate longer responses?
Any hints/clarifications are appreciated.
Hi all, Could somebody please explain to me the reason why
average_log_prob=False
make model to generate longer responses? Any hints/clarifications are appreciated.
I've noticed that the model tends to generate longer responses as training progresses. I suspect that setting average_log_prob=True might slow down this process compared to when it's set to False. @longbowzhang , have you not encountered this issue when you've set it to True?
I ran into the same issue a few months ago and didn't have any success with
average_log_prob=True
-- the model became very degenerative. Ultimately I leftaverage_log_prob=False
and had to add some extra tricks to keep DPO from teaching the model to write very long responses.
@dblakely Could you please share " extra tricks"?
Hey @yata0, the author mentioned some ideas here and I tried each of those 4 suggestions. All of them helped to some extent. To "normalize" the data lengths, I simply dropped a bunch of the longest positive examples from my dataset to bring the length distribution of positives and negatives closer together (a big part of the problem in my case was simply that the positive examples in my dataset were on average a fair amount longer than the negatives and DPO was over-optimizing that trait).
Hey @yata0, the author mentioned some ideas here and I tried each of those 4 suggestions. All of them helped to some extent. To "normalize" the data lengths, I simply dropped a bunch of the longest positive examples from my dataset to bring the length distribution of positives and negatives closer together (a big part of the problem in my case was simply that the positive examples in my dataset were on average a fair amount longer than the negatives and DPO was over-optimizing that trait).
Thanks!
Hi, I see there is a bool variable in _get_batch_logps of trainers.py to control whether get the average log probability or not. And I have two questions.
Hope you could help me on these questions, thanks a lot!