Adds GRPO (Group Relative Policy Optimization) implementation for LLM Reinforcement Learning. GRPO delivers PPO-level performance gains for mathematical reasoning while using significantly less memory (no critic model needed), and can achieve substantial accuracy improvements using just existing instruction tuning data.
๐ Type of change
Select all that apply:
[ ] ๐ Bug fix (non-breaking change that addresses a specific issue)
[x] ๐ New feature (non-breaking change that adds functionality)
[ ] โ ๏ธ Breaking change (a change that could affect existing functionality)
โจ Description
Adds GRPO (Group Relative Policy Optimization) implementation for LLM Reinforcement Learning. GRPO delivers PPO-level performance gains for mathematical reasoning while using significantly less memory (no critic model needed), and can achieve substantial accuracy improvements using just existing instruction tuning data.
๐ Type of change
Select all that apply:
๐ Changes
List the key changes introduced in this PR:
โ Checklist
Make sure the following tasks are completed before submitting the PR:
General:
Dependencies and Configuration:
Testing:
Performance Impact:
๐ Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
๐ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.