Open Davood-M opened 1 week ago
Adding RPO on multiple responses for alignment. RPO is able to take a dataset with a variable number of responses per prompt.
{ "prompt": ..., "responses": [ list of responses ], "rewards": [ list of rewards ] }
Pre checks:
max_steps=-1
validation
What does this PR do ?
Adding RPO on multiple responses for alignment. RPO is able to take a dataset with a variable number of responses per prompt.
Changelog
Usage
Before your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1
andvalidation
?Additional Information