NVIDIA / NeMo-Aligner

Scalable toolkit for efficient model alignment

Apache License 2.0

522 stars 58 forks source link

Self-Rewarding Algorithm with TRT Support #321

Open trias702 opened 5 days ago

trias702 commented 5 days ago

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020 https://arxiv.org/abs/2407.19594

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

[X] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[X] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

[X] Does the trainer resume and restore model state all states?
[X] Does the trainer support all parallelism techniques(PP, TP, DP)?
[X] Does the trainer support max_steps=-1 and validation?
[X] Does the trainer only call APIs defined in alignable_interface.py?
[X] Does the trainer have proper logging?

Additional Information

Related to # (issue)