dvlab-research / Step-DPO

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"
285 stars 9 forks source link

question about StepDPOTrainer #18

Closed FlyingDutchman26 closed 2 months ago

FlyingDutchman26 commented 2 months ago

Hi, thanks for your work! I have a small problem.

It seems that you have implemented a StepDPOTrainer class which inherits from trl DPOTrainer, and you have implemented a function 'tokenize_row'. However, DPOTrainer does not have the 'tokenize_row' function, it belongs to 'OnlineDPOTrainer', so I wonder whether the StepDPOTrainer is really used in your training.

This may not affect your result. But I am curious about your 'tokenize_row' function, what does this function do? Maybe you want to use OnlineDPOTrainer?

Best wishes!

FlyingDutchman26 commented 2 months ago

Thanks, I understand. This problem is caused by different version of trl.