WecoAI / aideml

AIDE: the Machine Learning CodeGen Agent
https://www.weco.ai
MIT License
289 stars 22 forks source link
ai data-science llm machine-learning

AIDE: the Machine Learning CodeGen Agent

License: MITPyPIPython 3.10+ DiscordTwitter Follow

AIDE is an LLM agent that generates solutions for machine learning tasks just from natural language descriptions of the task. In a benchmark composed of over 60 Kaggle data science competitions, AIDE demonstrated impressive performance, surpassing 50% of Kaggle participants on average (see our technical report for details). More specifically, AIDE has the following features:

  1. Instruct with Natural Language: Describe your problem or additional requirements and expert insights, all in natural language.
  2. Deliver Solution in Source Code: AIDE will generate Python scripts for the tested machine learning pipeline. Enjoy full transparency, reproducibility, and the freedom to further improve the source code!
  3. Iterative Optimization: AIDE iteratively runs, debugs, evaluates, and improves the ML code, all by itself.
  4. Visualization: We also provide tools to visualize the solution tree produced by AIDE for a better understanding of its experimentation process. This gives you insights not only about what works but also what doesn't.

How to use AIDE?

Setup

Make sure you have Python>=3.10 installed and run:

pip install -U aideml

Also install unzip to allow the agent to autonomously extract your data.

Set up your OpenAI (or Anthropic) API key:

export OPENAI_API_KEY=<your API key>
# or
export ANTHROPIC_API_KEY=<your API key>

Running AIDE via the command line

To run AIDE:

aide data_dir="<path to your data directory>" goal="<describe the agent's goal for your task>" eval="<(optional) describe the evaluation metric the agent should use>"

For example, to run AIDE on the example house price prediction task:

aide data_dir="example_tasks/house_prices" goal="Predict the sales price for each house" eval="Use the RMSE metric between the logarithm of the predicted and observed values."

Options:

Alternatively, you can provide the entire task description as a desc_str string, or write it in a plaintext file and pass its path as desc_file (example file).

aide data_dir="my_data_dir" desc_file="my_task_description.txt"

The result of the run will be stored in the logs directory.

The workspaces directory will contain all the files and data that the agent generated.

Advanced Usage

To further customize the behaviour of AIDE, some useful options might be:

You can check the config.yaml file for more options.

Using AIDE in Python

Using AIDE within your Python script/project is easy. Follow the setup steps above, and then create an AIDE experiment like below and start running:

import aide
exp = aide.Experiment(
    data_dir="example_tasks/bitcoin_price",  # replace this with your own directory
    goal="Build a timeseries forcasting model for bitcoin close price.",  # replace with your own goal description
    eval="RMSLE"  # replace with your own evaluation metric
)

best_solution = exp.run(steps=10)

print(f"Best solution has validation metric: {best_solution.valid_metric}")
print(f"Best solution code: {best_solution.code}")

Development

To install AIDE for development, clone this repository and install it locally.

git clone https://github.com/WecoAI/aideml.git
cd aideml
pip install -e .

Contribution guide will be available soon.

Algorithm Description

AIDE's problem-solving approach is inspired by how human data scientists tackle challenges. It starts by generating a set of initial solution drafts and then iteratively refines and improves them based on performance feedback. This process is driven by a technique we call Solution Space Tree Search.

At its core, Solution Space Tree Search consists of three main components:

By repeatedly applying these steps, AIDE navigates the vast space of possible solutions, progressively refining its approach until it converges on the optimal solution for the given data science problem.

Tree Search Visualization