computer-agents / agent-studio

Environments, tools, and benchmarks for general computer agents
https://computer-agents.github.io/agent-studio/
GNU Affero General Public License v3.0
160 stars 14 forks source link
ai-agents benchmark environment language-model

AgentStudio

Python 3.11 Code style: black License: AGPL v3 pre-commit

AgentStudio is a trinity of environments, tools, and benchmarks for general virtual agents to interact with any computer software. AgentStudio targets the desiderata for robust, general, and open-ended virtual agents by providing:

  1. A lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions
  2. Tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos
  3. Online benchmark tasks that evaluate both GUI interactions and function calling with auto-evaluation and language feedback
  4. Three benchmark datasets: GroundUI, IDMBench, and CriticBench, for fundamental agent abilities, including GUI grounding, learning from videos, and success detection

Comparisons with existing work:

News

Installation

Install requirements:

apt-get install gnome-screenshot xclip xdotool  # If using Ubuntu 22.04
conda create --name agent-studio python=3.11 -y
conda activate agent-studio
pip install -e '.[client]'

All confidential API keys should be stored in agent_studio/config/api_key.json, e.g., OpenAI API key, Claude API key, Gemini API key, etc. We have provided an example config in agent_studio/config/api_key_template.json.

AgentStudio Overall Benchmark Tasks

AgentStudio provides the most generic observation and action spaces, which significantly expands the task space, allowing for developing and evaluating agents in real-world settings. We introduce a benchmark suite consisting of 205 tasks. These tasks span API usages such as terminal and Gmail and GUI software like VS Code. Please find more in eval_online_benchmarks/README.md. The task-related files are available at our project page.

AgentStudio Datasets Decompose Agent Abilities

To gain deeper insights into agent capabilities beyond the overall performance measured by online benchmark tasks, we develop three datasets using AgentStudio: GroundUI, IDMBench, and CriticBench. These datasets target general UI grounding, learning from videos, and success detection. More details are provided in eval_agent_desiderata/README.md. All data are available at our project page.

AgentStudio Tools

To facilitate the development and evaluation of agents within the AgentStudio environment, we provide three tools for:

These tools, combined with the realistic environment of AgentStudio, contribute to the generation of rich, structured data for training and evaluating agents. Please refer to docs/annotate_ground_ui.md for the GUI annotation tool, agent_studio/recorder/README.md for the video-action annotation tool, and eval_online_benchmarks/README.md for the task creation/validation.

Contributing

Contributions and feedback from everyone on how to make this into a better tool are more than welcome. Please check out CONTRIBUTING.md for how to get involved.

Acknowledgement

We would like to thank the following projects for their inspiration and contributions to the open-source community: Open Interpreter, WebArena, Cradle, Synapse, SeeClick, ScreenAgent, OSWorld, etc.

Citation

If you find AgentStudio useful, please cite our paper:

@article{zheng2024agentstudio,
  title={AgentStudio: A Toolkit for Building General Virtual Agents},
  author={Longtao Zheng and Zhiyuan Huang and Zhenghai Xue and Xinrun Wang and Bo An and Shuicheng Yan},
  journal={arXiv preprint arXiv:2403.17918},
  year={2024}
}