[Feature] dpo场景为什么 onlinedpo效果不如dpo，rlhf会有下降？

alibaba / ChatLearn

A flexible and efficient training framework for large-scale alignment tasks

Apache License 2.0

216 stars 17 forks source link

[Feature] dpo场景为什么 onlinedpo效果不如dpo，rlhf会有下降？ #159

Closed yiyepiaoling0715 closed 1 day ago

yiyepiaoling0715 commented 4 days ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

adoda commented 2 days ago

alibaba / ChatLearn

[Feature] dpo场景 为什么 onlinedpo效果不如dpo，rlhf会有下降？ #159

[Feature] dpo场景为什么 onlinedpo效果不如dpo，rlhf会有下降？ #159