RockeyCoss / SPO

Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step
https://arxiv.org/abs/2406.04314
137 stars 3 forks source link

About reward model dataset or reward model #3

Closed jiashenggu closed 3 months ago

jiashenggu commented 3 months ago

Hi, great work! I see you will release training code. How about reward model dataset or reward model?

pljj315 commented 3 months ago

加一

RockeyCoss commented 3 months ago

Thank you for your interest. We have released the fine-tuning code, fine-tuning prompts, and the checkpoint of reward models

pljj315 commented 3 months ago

刚刚评论就发现已经released wow~ ⊙o⊙