URL

https://www.arxiv.org/abs/2409.19924
Affiliations
- Kevin Wang, N/A
- Junbo Li, N/A
- Neel P. Bhatt, N/A
- Yihan Xi, N/A
- Qiang Liu, N/A
- Ufuk Topcu, N/A
- Zhangyang Wang, N/A
  Abstract
- Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning. Code available at https://github.com/VITA-Group/o1-planning.
  Translation (by gpt-4o-mini)
最近の大規模言語モデル（LLMs）の進展は、複雑な推論タスクを実行する能力を示していますが、計画におけるその効果はまだ十分に探求されていません。本研究では、OpenAIのo1モデルの計画能力をさまざまなベンチマークタスクにわたって評価し、実現可能性、最適性、一般化の3つの重要な側面に焦点を当てています。制約が多いタスク（例：$\textit{Barman}$、$\textit{Tyreworld}$）や空間的に複雑な環境（例：$\textit{Termes}$、$\textit{Floortile}$）に関する実証評価を通じて、o1-previewの自己評価と制約遵守における強みを強調しつつ、特に堅牢な空間推論を必要とするタスクにおける意思決定とメモリ管理のボトルネックを特定しました。私たちの結果は、o1-previewが構造化された環境においてタスクの制約を遵守し、状態遷移を管理する点でGPT-4を上回ることを示しています。しかし、このモデルはしばしば冗長なアクションを伴う最適でない解を生成し、空間的に複雑なタスクにおいて効果的に一般化するのに苦労しています。このパイロット研究は、LLMsの計画における限界に関する基礎的な洞察を提供し、LLMベースの計画におけるメモリ管理、意思決定、および一般化の改善に向けた今後の研究の重要な方向性を示しています。コードはhttps://github.com/VITA-Group/o1-planningで入手可能です。
Summary (by gpt-4o-mini)
本研究では、OpenAIのo1モデルの計画能力を評価し、実現可能性、最適性、一般化の3つの側面に焦点を当てています。特に、制約の多いタスクや空間的に複雑な環境における強みとボトルネックを特定しました。o1-previewは、構造化された環境での制約遵守においてGPT-4を上回る一方で、冗長なアクションを伴う最適でない解を生成し、一般化に苦労しています。この研究は、LLMsの計画における限界を明らかにし、今後の改善の方向性を示しています。

AkihikoWatanabe / paper_notes

On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability, Kevin Wang+, N/A, arXiv'24, 2024.11 #1477

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)