AkihikoWatanabe commented 1 year ago

URL

https://arxiv.org/abs/2307.04964
Affiliations
- Rui Zheng, N/A
- Shihan Dou, N/A
- Songyang Gao, N/A
- Wei Shen, N/A
- Binghai Wang, N/A
- Yan Liu, N/A
- Senjie Jin, N/A
- Qin Liu, N/A
- Limao Xiong, N/A
- Lu Chen, N/A
- Zhiheng Xi, N/A
- Yuhao Zhou, N/A
- Nuo Xu, N/A
- Wenbin Lai, N/A
- Minghao Zhu, N/A
- Rongxiang Weng, N/A
- Wensen Cheng, N/A
- Cheng Chang, N/A
- Zhangyue Yin, N/A
- Yuan Hua, N/A
- Haoran Huang, N/A
- Tianxiang Sun, N/A
- Hang Yan, N/A
- Tao Gui, N/A
- Qi Zhang, N/A
- Xipeng Qiu, N/A
- Xuanjing Huang, N/A
  Abstract
- Large language models (LLMs) have formulated a blueprint for the advancementof artificial general intelligence. Its primary objective is to function as ahuman-centric (helpful, honest, and harmless) assistant. Alignment with humansassumes paramount significance, and reinforcement learning with human feedback(RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.Current technical routes usually include \textbf{reward models} to measurehuman preferences, \textbf{Proximal Policy Optimization} (PPO) to optimizepolicy model outputs, and \textbf{process supervision} to improve step-by-stepreasoning capabilities. However, due to the challenges of reward design,environment interaction, and agent training, coupled with huge trial and errorcost of large language models, there is a significant barrier for AIresearchers to motivate the development of technical alignment and safe landingof LLMs. The stable training of RLHF has still been a puzzle. In the firstreport, we dissect the framework of RLHF, re-evaluate the inner workings ofPPO, and explore how the parts comprising PPO algorithms impact policy agenttraining. We identify policy constraints being the key factor for the effectiveimplementation of the PPO algorithm. Therefore, we explore the PPO-max, anadvanced version of PPO algorithm, to efficiently improve the trainingstability of the policy model. Based on our main results, we perform acomprehensive analysis of RLHF abilities compared with SFT models and ChatGPT.The absence of open-source implementations has posed significant challenges tothe investigation of LLMs alignment. Therefore, we are eager to releasetechnical reports, reward models and PPO codes
  Translation (by gpt-3.5-turbo)
大規模言語モデル（LLMs）は、人工汎用知能の進歩のための設計図を提供しています。その主な目的は、人間中心（助けになり、正直で、無害な）のアシスタントとして機能することです。人間との調和は最も重要であり、人間のフィードバックを用いた強化学習（RLHF）がこの追求の基本的な技術的パラダイムとして浮かび上がっています。現在の技術的手法には、人間の好みを測定するための報酬モデル、ポリシーモデルの出力を最適化するためのProximal Policy Optimization（PPO）、およびステップバイステップの推論能力を向上させるためのプロセス監視が通常含まれています。しかし、報酬設計、環境との相互作用、エージェントのトレーニングの課題、および大規模言語モデルの試行錯誤のコストの高さにより、技術的な調和の開発とLLMsの安全な着陸を促進するためのAI研究者には重要な障壁があります。RLHFの安定したトレーニングはまだ謎です。最初の報告では、RLHFのフレームワークを解析し、PPOの内部動作を再評価し、PPOアルゴリズムを構成する要素がポリシーエージェントのトレーニングにどのように影響を与えるかを探求します。ポリシー制約がPPOアルゴリズムの効果的な実装の鍵となることを特定しました。したがって、ポリシーモデルのトレーニングの安定性を効率的に改善するためのPPO-maxというPPOアルゴリズムの高度なバージョンを探求します。主な結果に基づいて、SFTモデルとChatGPTと比較してRLHFの能力を包括的に分析します。オープンソースの実装の欠如は、LLMsの調和の調査に重大な課題を提起しています。したがって、私たちは技術レポート、報酬モデル、およびPPOコードを公開することを熱望しています。
Summary (by gpt-3.5-turbo)
大規模言語モデル（LLMs）を使用した人間中心のアシスタントの開発には、報酬設計やトレーニングの課題などの障壁があります。この研究では、強化学習（RLHF）のフレームワークを解析し、PPOアルゴリズムの内部動作を再評価し、ポリシーモデルのトレーニングの安定性を改善するための高度なバージョンを提案します。さらに、SFTモデルとChatGPTと比較してRLHFの能力を分析し、オープンソースの実装を公開することを目指しています。

AkihikoWatanabe commented 1 year ago

RLHFとPPOをの内部構造を調査したレポート。RLHFに興味がある場合は読むべし。

AkihikoWatanabe commented 1 year ago

github: https://github.com/OpenLMLab/MOSS-RLHF

AkihikoWatanabe / paper_notes

Secrets of RLHF in Large Language Models Part I: PPO, Rui Zheng+, N/A, arXiv'23 #807

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)