Configs now have room for custom sample prompts rather than just randomly sampling from pipeline. You can now track how the models generations for these fixed prompts changes
More debug metrics like KL divergence between model before/after a single RL training step and fraction of updates that were clipped by PPO clip on policy ratio
Loss was not being reduced across GPUs before. Now it is.