URL

https://arxiv.org/abs/2409.12183
Affiliations
- Zayne Sprague, N/A
- Fangcong Yin, N/A
- Juan Diego Rodriguez, N/A
- Dongwei Jiang, N/A
- Manya Wadhwa, N/A
- Prasann Singhal, N/A
- Xinyu Zhao, N/A
- Xi Ye, N/A
- Kyle Mahowald, N/A
- Greg Durrett, N/A
  Abstract
- Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
  Translation (by gpt-4o-mini)
Chain-of-thought（CoT）を用いたプロンプティングは、大規模言語モデル（LLMs）から推論能力を引き出すための事実上の方法である。しかし、この追加の「思考」が本当に役立つのはどのようなタスクなのだろうか？これを分析するために、私たちはCoTを使用した100以上の論文を対象に定量的メタ分析を行い、14のモデルにわたる20のデータセットで独自の評価を実施した。結果は、CoTが主に数学や論理に関わるタスクで強力なパフォーマンス向上をもたらす一方で、他のタイプのタスクでははるかに小さな利点しか得られないことを示している。MMLUにおいては、CoTを使用せずに直接答えを生成することが、質問やモデルの応答に等号が含まれていない限り、CoTとほぼ同じ精度をもたらすことがわかった。これは、記号的操作や推論を示唆している。この発見に基づき、私たちは計画と実行を分離し、ツールを強化したLLMsと比較することで、これらの問題に対するCoTの挙動を分析した。CoTの多くの利点は、記号的実行の改善から来ているが、記号ソルバーを使用する場合に比べて劣っている。私たちの結果は、CoTを選択的に適用することで、パフォーマンスを維持しつつ推論コストを節約できることを示している。さらに、これらの結果は、プロンプトベースのCoTを超えて、LLMアプリケーション全体で中間計算をより効果的に活用する新しいパラダイムへの移行の必要性を示唆している。
Summary (by gpt-4o-mini)
Chain-of-thought（CoT）プロンプティングはLLMsの推論能力を引き出す手法であり、100以上の論文を対象にしたメタ分析により、主に数学や論理タスクでのパフォーマンス向上が確認された。一方、他のタスクでは効果が限定的で、MMLUでは直接回答生成がCoTと同等の精度を示した。計画と実行を分離し、ツール強化LLMsと比較した結果、CoTの利点は記号的実行の改善に起因し、記号ソルバーには劣ることが分かった。CoTの選択的適用により、推論コストを節約しつつパフォーマンスを維持できる可能性が示唆され、LLMアプリケーション全体での中間計算の活用が求められている。

AkihikoWatanabe / paper_notes

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, Zayne Sprague+, N/A, arXiv'24 #1406

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)