THUDM / ReST-MCTS

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
74 stars 5 forks source link

Utilization of negative samples #2

Open HillZhang1999 opened 2 months ago

HillZhang1999 commented 2 months ago

Dear authors: First of all, I appreciate your engaging and informative work! I have a question regarding your research: I noticed that you only utilize positive samples for SFT when enhancing the policy models. Have you considered incorporating negative samples through methods such as DPO?

zhangdan0602 commented 1 month ago

Thank you for your question! Indeed, we reproduce the baseline, Self-Rewarding, which runs the DPO using negative samples of LLMs judgments.