Large language models (LLMs) have formulated a blueprint for the advancementof artificial general intelligence. Its primary objective is to function as ahuman-centric (helpful, honest, and harmless) assistant. Alignment with humansassumes paramount significance, and reinforcement learning with human feedback(RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.Current technical routes usually include \textbf{reward models} to measurehuman preferences, \textbf{Proximal Policy Optimization} (PPO) to optimizepolicy model outputs, and \textbf{process supervision} to improve step-by-stepreasoning capabilities. However, due to the challenges of reward design,environment interaction, and agent training, coupled with huge trial and errorcost of large language models, there is a significant barrier for AIresearchers to motivate the development of technical alignment and safe landingof LLMs. The stable training of RLHF has still been a puzzle. In the firstreport, we dissect the framework of RLHF, re-evaluate the inner workings ofPPO, and explore how the parts comprising PPO algorithms impact policy agenttraining. We identify policy constraints being the key factor for the effectiveimplementation of the PPO algorithm. Therefore, we explore the PPO-max, anadvanced version of PPO algorithm, to efficiently improve the trainingstability of the policy model. Based on our main results, we perform acomprehensive analysis of RLHF abilities compared with SFT models and ChatGPT.The absence of open-source implementations has posed significant challenges tothe investigation of LLMs alignment. Therefore, we are eager to releasetechnical reports, reward models and PPO codes
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)