NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment

Apache License 2.0

522 stars 58 forks source link

RPO on multiple responses #311

Open Davood-M opened 1 week ago

Davood-M commented 1 week ago

What does this PR do ?

Adding RPO on multiple responses for alignment. RPO is able to take a dataset with a variable number of responses per prompt.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

The dataset should be formatted like this:

{
"prompt": ...,
"responses": [ list of responses ],
"rewards": [ list of rewards ]
}

Before your PR is "Ready for review"

Pre checks:

[x] Make sure you read and followed Contributor guidelines
[x] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

[x] Does the trainer resume and restore model state all states?
[x] Does the trainer support all parallelism techniques(PP, TP, DP)?
[x] Does the trainer support max_steps=-1 and validation?
[ ] Does the trainer only call APIs defined in alignable_interface.py?
[x] Does the trainer have proper logging?

Additional Information

Related to # (issue)