URL

https://arxiv.org/abs/2310.03302
Affiliations
- Qian Huang, N/A
- Jian Vora, N/A
- Percy Liang, N/A
- Jure Leskovec, N/A
  Abstract
- Scientific experimentation involves an iterative process of creatinghypotheses, designing experiments, running experiments, and analyzing theresults. Can we build AI research agents to perform these long-horizon tasks?To take a step towards building and evaluating research agents on suchopen-ended decision-making tasks, we focus on the problem of machine learningengineering: given a task description and a dataset, build a high-performingmodel. In this paper, we propose MLAgentBench, a suite of ML tasks forbenchmarking AI research agents. Agents can perform actions likereading/writing files, executing code, and inspecting outputs. With theseactions, agents could run experiments, analyze the results, and modify the codeof entire machine learning pipelines, such as data processing, architecture,training processes, etc. The benchmark then automatically evaluates the agent'sperformance objectively over various metrics related to performance andefficiency. We also design an LLM-based research agent to automatically performexperimentation loops in such an environment. Empirically, we find that aGPT-4-based research agent can feasibly build compelling ML models over manytasks in MLAgentBench, displaying highly interpretable plans and actions.However, the success rates vary considerably; they span from almost 90\% onwell-established older datasets to as low as 10\% on recent Kaggle Challenges-- unavailable during the LLM model's pretraining -- and even 0\% on newerresearch challenges like BabyLM. Finally, we identify several key challengesfor LLM-based research agents such as long-term planning and hallucination. Ourcode is released at https://github.com/snap-stanford/MLAgentBench.
  Translation (by gpt-3.5-turbo)
科学的な実験は、仮説の立案、実験の設計、実験の実施、結果の分析という反復的なプロセスを含みます。このような長期的なタスクを実行するためのAI研究エージェントを構築することは可能でしょうか？このようなオープンエンドの意思決定タスクで研究エージェントを構築し評価するために、私たちは機械学習エンジニアリングの問題に焦点を当てます。具体的には、タスクの説明とデータセットが与えられた場合に、高性能なモデルを構築することです。本論文では、AI研究エージェントのベンチマークとしてMLAgentBenchというMLタスクのスイートを提案します。エージェントは、ファイルの読み書き、コードの実行、出力の検査などのアクションを実行することができます。これらのアクションにより、エージェントは実験を実行し、結果を分析し、データ処理、アーキテクチャ、トレーニングプロセスなどの機械学習パイプライン全体のコードを変更することができます。その後、ベンチマークは、パフォーマンスと効率に関連するさまざまなメトリックに基づいて、エージェントのパフォーマンスを客観的に評価します。また、このような環境で自動的に実験ループを実行するLLMベースの研究エージェントを設計します。実証的には、GPT-4ベースの研究エージェントは、MLAgentBenchの多くのタスクで魅力的なMLモデルを実現できることがわかりました。また、高度に解釈可能な計画とアクションを表示します。ただし、成功率は大きく異なります。確立された古いデータセットではほぼ90％であり、最近のKaggleのチャレンジでは10％以下であり、LLMモデルの事前学習時には利用できないBabyLMなどの新しい研究課題では0％です。最後に、長期的な計画や幻覚といったLLMベースの研究エージェントのいくつかの重要な課題を特定します。私たちのコードはhttps://github.com/snap-stanford/MLAgentBenchで公開されています。
Summary (by gpt-3.5-turbo)
本研究では、AI研究エージェントを構築し、科学的な実験のタスクを実行するためのベンチマークとしてMLAgentBenchを提案する。エージェントはファイルの読み書きやコードの実行などのアクションを実行し、実験を実行し、結果を分析し、機械学習パイプラインのコードを変更することができる。GPT-4ベースの研究エージェントは多くのタスクで高性能なモデルを実現できるが、成功率は異なる。また、LLMベースの研究エージェントにはいくつかの課題がある。

AkihikoWatanabe / paper_notes

Benchmarking Large Language Models As AI Research Agents, Qian Huang+, N/A, arXiv'23 #1067

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)