Scientific experimentation involves an iterative process of creatinghypotheses, designing experiments, running experiments, and analyzing theresults. Can we build AI research agents to perform these long-horizon tasks?To take a step towards building and evaluating research agents on suchopen-ended decision-making tasks, we focus on the problem of machine learningengineering: given a task description and a dataset, build a high-performingmodel. In this paper, we propose MLAgentBench, a suite of ML tasks forbenchmarking AI research agents. Agents can perform actions likereading/writing files, executing code, and inspecting outputs. With theseactions, agents could run experiments, analyze the results, and modify the codeof entire machine learning pipelines, such as data processing, architecture,training processes, etc. The benchmark then automatically evaluates the agent'sperformance objectively over various metrics related to performance andefficiency. We also design an LLM-based research agent to automatically performexperimentation loops in such an environment. Empirically, we find that aGPT-4-based research agent can feasibly build compelling ML models over manytasks in MLAgentBench, displaying highly interpretable plans and actions.However, the success rates vary considerably; they span from almost 90\% onwell-established older datasets to as low as 10\% on recent Kaggle Challenges-- unavailable during the LLM model's pretraining -- and even 0\% on newerresearch challenges like BabyLM. Finally, we identify several key challengesfor LLM-based research agents such as long-term planning and hallucination. Ourcode is released at https://github.com/snap-stanford/MLAgentBench.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)