EmergenceAI / Agent-E

Agent driven automation starting with the web. Discord: https://discord.gg/wgNfmFuqJF
MIT License
345 stars 46 forks source link

Agent-E

Agent-E is an agent based system that aims to automate actions on the user's computer. At the moment it focuses on automation within the browser. The system is based on on AutoGen agent framework.

This provides a natural language way to interacting with a web browser:

While Agent-E is growing, it is already equipped to handle a versatile range of tasks, but the best task is the one that you come up with. So, take it for a spin and tell us what you were able to do with it. For more information see our blog article.

Quick Start

Setup

pip issues

If you run into an issue where pip is not installed in the virtual env, you can take the following steps:

  1. activate the venv
  2. python -m ensurepip --upgrade This will install pip
  3. Deactivate the venv: deactivate
  4. Activate the venv again
  5. If you look in the .venv/bin dir you will not see pip3. At this point, you do not have pip, but you have pip3

Blocking IO issues:

If you are on mac and you are getting BlockingIOError: [Errno 35] write could not complete without blocking when autogen tries to print large amont of text:

User preferences

To personalize this agent, there is a need for Long Term Memory (LTM) that tracks user preferences over time. For the time being we provide a user preferences free form text file that acts as a static LTM. You can see a sample here. Feel free to customize this file as you wish making it more personal to you. This file might move to .gitignore in future changes.

Run the code:

python -m ae.main (if you are on a Mac, python -u -m ae.main See blocking IO issues above) Once the program is running, you should see an icon on the browser. The icon expands to chat-like interface where you can enter natural language requests. For example, open youtube, search youtube for funny cat videos, find Nothing Phone 2 on Amazon and sort the results by best seller, etc.

Demos

Video Command Description
Oppenheimer Video There is an Oppenheimer video on youtube by Veritasium, can you find it and play it?
  • Navigates to www.youtube.com
  • Searches for Oppenheimer Veritasium using the searchbar
  • Plays the correct video
Example 2: Use information to fill forms Can you do this task? Wait for me to review before submitting. Takes the highlighted text from the email as part of the instruction.
  • Navigates to the form URL
  • Identifies elements in the form to fill
  • Fills the form using information from memory defined in user preferences.txt
  • Waits for user to review before submitting the form
Example 3: Find and add specific product to amazon cart Find Finish dishwasher detergent tablets on amazon, sort by best seller and add the first one to my cart
  • Navigates to www.amazon.com
  • Searches for Finish dishwasher detergent tablets using amazon search feature
  • Sorts the search results by best seller
  • Selects the first product to navigate to the the product page of the first product.
  • Adds the product to cart
(Note: sometimes add to cart part does not execute, but a simple add first one to my shopping cart works)
Example 4: Verify truthfulness of info using primary source Is this information about free courses true? Manually navigate to: https://twitter.com/aisolopreneur/status/1772686923045413123. Then give the command.
  • Navigates to NVDIA homepage.
  • Clicks on the developer link to navigate to developer page.
  • Clicks on the free courses link to navigate to courses page.
  • Validates availability of free courses and answers the user

Architecture

Agent-E system view

Building on the foundation provided by the AutoGen agent framework, Agent-E's architecture leverages the interplay between skills and agents. Each skill embodies an atomic action, a fundamental building block that, when executed, returns a natural language description of its outcome. This granularity allows Agent-E to flexibly assemble these skills to tackle complex web automation workflows.

Agent-E AutoGen setup

The diagram above shows the configuration chosen on top of AutoGen. The skills can be partitioned differently, but this is the one that we chose for the time being. We chose to use skills that map to what humans learn about the web browser rather than allow the LLM to write code as it pleases. We see the use of configured skills to be safer and more predictable in its outcomes. Certainly it can click on the wrong things, but at least it is not going to execute malicious unknown code.

Agents

At the moment there are two agents, the User proxy (executes the skills), and Browser navigation. Browser navigation agent embodies all the skills for interacting with the web browser.

Skills Library

At the core of Agent-E's capabilities is the Skills Library, a repository of well-defined actions that the agent can perform; for now web actions. These skills are grouped into two main categories:

Each skill is created with the intention to be as conversational as possible, making the interactions with LLMs more intuitive and error-tolerant. For instance, rather than simply returning a boolean value, a skill might explain in natural language what happened during its execution, enabling the LLM to better understand the context and correct course if necessary.

Below are the skills we have implemented:

Sensing Skills Action Skills
geturl - Fetches and returns the current url. click - given a DOM query selector, this will click on it.
get_dom_with_content_type - Retrieves the HTML DOM of the active page based on the specified content type. Content type can be:
- text_only: Extracts the inner text of the html DOM. Responds with text output.
- input_fields: Extracts the interactive elements in the DOM (button, input, textarea, etc.) and responds with a compact JSON object.
- all_fields: Extracts all the fields in the DOM and responds with a compact JSON object.
enter_text_and_click - Optimized method that combines enter text and click skills. The optimization here helps use cases such as enter text in a field and press the search button. Since the DOM would not have changed or changes should be immaterial to this action, identifying both selectors for an input field and an actionable button can happen based on the same DOM examination.
get_user_input - Provides the orchestrator with a mechanism to receive user feedback to disambiguate or seek clarity on fulfilling their request. bulk_enter_text - Optimized method that wraps enter_text method so that multiple text entries can be performed one shot.
enter_text - Enters text in a field specified by the provided DOM query selector.
openurl - Opens the given URL in current or new tab.

DOM Distillation

Agent-E's approach to managing the vast landscape of HTML DOM is methodical and, frankly, essential for efficiency. We've introduced DOM Distillation to pare down the DOM to just the elements pertinent to the user's task.

In practice, this means taking the expansive DOM and delivering a more digestible JSON snapshot. This isn't about just reducing size, it's about honing in on relevance, serving the LLMs only what's necessary to fulfill a request. So far we have three content types:

It's a surgical procedure, carefully removing extraneous information while preserving the structure and content needed for the agent’s operation. Of course with any distillation there could be casualties, but the idea is to refine this over time to limit/eliminate them.

Since we can't rely on all web page authors to use best practices, such as adding unique ids to each HTML element, we had to inject our own attribute (mmid) in every DOM element. We can then guide the LLM to rely on using mmid in the generated DOM queries.

To cutdown on some of the DOM noise, we use the DOM Accessibility Tree rather than the regular HTML DOM. The accessibility tree by nature is geared towards helping screen readers, which is closer to the mission of web automation than plain old HTML DOM.

The distillation process is a work in progress. We look to refine this process and condense the DOM further aiming to make interactions faster, cost-effective, and more accurate.

Testing and benchmarking

We build on the work done by Web Arena for testing and evaluation. The test directory contains a tasks sub directory with a JSON file, which contains test cases that also act as examples. Not all of them will pass. While Web Arena creates a set of static and controlled sites, we opted for using the wild web to bring the experience closer to what we all experience on a daily basis. This comes with pluses and minuses of course.

Note: WebArena uses openai for some test validation strategies, for that reason OPENAI_API_KEY must be set in .env file

Run examples/tests:

This will take time to run. Alternatlively to run a particular example(s), modify the min and max task indicies. python -m test.run_tests (if you are on a Mac python -u -m test.run_tests)

Parameters for run_tests:

- `--min_task_index`: Minimum task index to start tests from (default: 0)
- `--max_task_index`: Maximum task index to end tests with, non-inclusive
- `--test_results_id`: A unique identifier for the test results. If not provided, a timestamp is used
- `--test_config_file`: Path to the test configuration file. Default is "test/tasks/test.json" in the project root.
- `--wait_time_non_headless`: The amount of time to wait between headless tests
- `--take_screenshots`: Takes screenshots after every operation performed. Example: `--take_screenshots true` Default to `false`

For example: python -m test.run_tests --min_task_index 0 --max_task_index 28 --test_results_id first_28_tests (add -u for Mac)

Docs generation:

Ensure that dev dependancies are installed before doing this.

  1. Go to project root
  2. mkdir docs
  3. cd docs
  4. sphinx-quickstart
  5. Modify/add to docs/conf.py the following:
    import os
    import sys
    sys.path.insert(0, os.path.abspath('..'))
    extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
    html_theme = 'sphinx_rtd_theme'
  6. Use api docs style for the generation, from project root run: sphinx-apidoc -o docs/source .
  7. Build the documentation, from docs directory, run: sphinx-build -b html . _build

Open-source models

Using open-source models is possible through LiteLLM with Ollama. Ollama allows users to run language models locally on their machines, and LiteLLM translates OpenAI-format inputs to local models' endpoints. To use open-source models as Agent-E backbone, follow the steps below:

  1. Install LiteLLM
    pip install 'litellm[proxy]'
  2. Install Ollama
    • For Mac and Windows, download Ollama.
    • For Linux:
      curl -fsSL https://ollama.com/install.sh | sh
  3. Pull Ollama models Before you can use a model, you need to download it from the library. The list of available models is here. Here, we use Mistral v0.3:
    ollama pull mistral:v0.3
  4. Run LiteLLM To run the downloaded model with LiteLLM as a proxy, run:
    litellm --model ollama_chat/mistral:v0.3
  5. Configure model in Autogen Configure the .env file as follows. Note that the model name and API keys are not needed since the local model is already running.
    AUTOGEN_MODEL_NAME=NotRequired
    AUTOGEN_MODEL_API_KEY=NotRequired
    AUTOGEN_MODEL_BASE_URL=http://0.0.0.0:400

TODO

Social:

Discord

Contributing

Thank you for your interest in contributing! We welcome involvement from the community.

Please visit our contributing guidelines for more details on how to get involved.

Citation

If you use this work, please cite our blog:

@misc{emergence2024distilling,
  title={Distilling the web for multi-agent automation},
  author={Emergence},
  howpublished={\url{https://blog.emergence.ai/2024/03/28/distilling-the-web-agent.html}},
  journal={Emergence Journal Blog},
  year={2024},
  month={Mar}
}