AkihikoWatanabe commented 2 weeks ago

URL

http://arxiv.org/abs/2411.04109
Authors
- Archiki Prasad
- Weizhe Yuan
- Richard Yuanzhe Pang
- Jing Xu
- Maryam Fazel-Zarandi
- Mohit Bansal
- Sainbayar Sukhbaatar
- Jason Weston
- Jane Yu
  Abstract
- Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
  Translation (by gpt-4o-mini)
自己調整（Self-alignment）とは、モデルが人間の注釈なしに自らを改善する方法であり、急速に成長している研究分野である。しかし、既存の技術は、正しい報酬を割り当てることの難しさから、複雑な推論タスクの改善に失敗することが多い。正確性を向上させることが知られている別のアプローチは自己一貫性（self-consistency）であり、これは複数のサンプリングに基づいて推論時に最も一貫した答えを見つけるために適用される方法である。本研究では、自己一貫性の概念を拡張してモデルの訓練に役立てる。具体的には、自己一貫性優先最適化（ScPO）を導入し、一貫した答えが不一致な答えよりも好まれるように、無監督の新しい問題に対して反復的に訓練を行う。ScPOは、GSM8KやMATHなどの推論タスクにおいて従来の報酬モデル訓練に対して大幅な改善をもたらし、金の答えや好みによる監視訓練とのギャップを縮めることを示す。また、ScPOと標準的な監視学習を組み合わせることで、結果がさらに向上することも示した。ZebraLogicにおいて、ScPOはLlama-3 8Bを微調整し、Llama-3 70B、Gemma-2 27B、Claude-3 Haikuを上回る性能を達成した。
Summary (by gpt-4o-mini)
自己調整は、モデルが人間の注釈なしに自らを改善する方法であり、自己一貫性を活用して訓練を行う新しいアプローチ、自己一貫性優先最適化（ScPO）を提案。ScPOは一貫した答えを優先し、GSM8KやMATHなどの推論タスクで従来の手法を大幅に上回る性能を示し、標準的な監視学習との組み合わせでも結果が向上。ZebraLogicでLlama-3 8Bを微調整し、他の大規模モデルを超える成果を達成。

AkihikoWatanabe commented 2 weeks ago

元ポスト:https://x.com/jaseweston/status/1854532624116547710?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe commented 2 weeks ago

AkihikoWatanabe / paper_notes

Self-Consistency Preference Optimization, Archiki Prasad+, arXiv'24 #1489

URL

Authors

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)