AkihikoWatanabe commented 9 months ago

URL

https://arxiv.org/abs/2312.16682
Affiliations
- Jing Xu, N/A
- Andrew Lee, N/A
- Sainbayar Sukhbaatar, N/A
- Jason Weston, N/A
  Abstract
- Practitioners commonly align large language models using pairwisepreferences, i.e., given labels of the type response A is preferred to responseB for a given input. Perhaps less commonly, methods have also been developedfor binary feedback, i.e. training models given labels of type response A isgood or bad. We show how an existing performant binary feedback method, theCringe Loss (Adolphs et al., 2022), can be generalized to the pairwisepreference setting using a simple soft margin extension. Pairwise Cringe Lossis straightforward to implement and efficient to train, and we find itoutperforms state-of-the-art preference optimization algorithms such as PPO andDPO on the AlpacaFarm benchmark.
  Translation (by gpt-3.5-turbo)
一般的に、大規模な言語モデルをペアワイズの選好によって整列させることがよく行われます。つまり、与えられた入力に対して、応答Aが応答Bよりも好まれるというタイプのラベルを使用します。おそらくそれほど一般的ではありませんが、バイナリフィードバックのための方法も開発されています。つまり、応答Aが良いまたは悪いというタイプのラベルを使用してモデルをトレーニングします。私たちは、既存のパフォーマンスの高いバイナリフィードバック手法であるCringe Loss（Adolphs et al.、2022）が、シンプルなソフトマージンの拡張を使用してペアワイズ選好の設定に一般化できることを示します。ペアワイズCringe Lossは実装が簡単でトレーニング効率も良く、AlpacaFarmベンチマークにおいてPPOやDPOなどの最先端の選好最適化アルゴリズムよりも優れたパフォーマンスを発揮することがわかりました。
Summary (by gpt-3.5-turbo)
一般的な言語モデルのトレーニングでは、ペアワイズの選好による整列がよく使われます。しかし、バイナリフィードバックの方法もあります。この研究では、既存のバイナリフィードバック手法をペアワイズ選好の設定に拡張し、高いパフォーマンスを示すことを示します。この手法は実装が簡単で効率的であり、最先端の選好最適化アルゴリズムよりも優れた性能を発揮します。

AkihikoWatanabe commented 9 months ago

DPO, PPOをoutperformする新たなAlignment手法。MetaのJason Weston氏

元ツイート: https://x.com/jaseweston/status/1740546297235464446?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe commented 9 months ago

後で読む

（画像は元ツイートより引用）

AkihikoWatanabe / paper_notes

Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss, Jing Xu+, N/A, arXiv'23 #1201

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)