e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
1.03k stars 140 forks source link

Augmentoolkit — infinite domain-specific instruct data

Your custom LLMs need custom data. Augmentoolkit creates quality data quickly, cheaply, and painlessly.

Now you can turn any raw text into a high-quality custom dataset for training new LLMs (or classifiers), using open-source AI. Make data gathering a painless step of the model creation process. Augmentoolkit is the easy-to-use, customizable, open-source, and cost-effective data generation solution. No OpenAI needed.

Being extensible, new pipelines can be added to Augmentoolkit incredibly easily, and there are already three of them: the original QA generation pipeline, the classifier creator, and a pipeline for generating creative writing data based on inputted fictional stories.

Augmentoolkit is an AI-powered tool that lets you create domain-specific data, using open-source AI.

If you like the project, please consider starring it!


RECENT FEATURES UPDATE — SEPTEMBER 12th 2024

In addition to a complete refactor that makes adding and using many different pipelines easy, Augmentoolkit can now make high-quality RP data based on the themes and narratives of any story imaginable.* Basically:

  1. LLM extracts the primary theme and various genre tags from a chunk of a story
  2. LLM generates a character card and plan for the overall story
  3. LLM uses a truly massive prompt — 22 thousand tokens long — to make a very long-context story
  4. Story is rated according to a set of criteria for non-repetitiveness and writing quality.
  5. Story is saved.

I used this pipeline to train make a medium-sized RP dataset to demonstrate the process* It's got about 1000 stories and 1,169,884 trainable tokensyou can check it out here!

So all you need to get quality RP data is now some stories you like and a button press. Finally you can make AI inspired by the same literature, games, or other fictional media you love — for instance, feed in Lord of the Rings, you get out high fantasy RP sessions. That is the intended utility of this new pipeline.

This pipeline can get a bit pricey if using an API, I recommend using local generation or renting compute on a service like Runpod. The really expensive step is story generation; it might make sense to take a hybrid approach and use an API for all non-storygen steps, but use a powerful local model on rented compute for story generation. This will allow for a good balance of speed and cost.

To get started, point super_config.yaml at any of the RPToolkit preset configs. You can check out detailed instructions and guidance in the RPToolkit section of this README

OK, back to your regularly-scheduled README.


Cite: DOI

Benefits

Augmentoolkit makes LLM data easy.

We've also done our best to facilitate the step after you generate your data -- training your LLM:

Finally, using the model you create should be easy and valuable:

Clarification: Augmentoolkit, the project, has multiple pipelines: the original pipeline (QA), RPtoolkit (rich multiturn roleplaying data), and the classifier creator. If it is said that "Augmentoolkit can make [some kind of data] then I mean that one of Augmentoolkit's pipelines can do so.

Demo video & Video Tutorials (EXTENSIVE LIBRARY):

3-Minute Demo Video Here

Quickstart Guide

Project Overview (for Intuition and understanding)

Local Dataset Generation Tutorial

Renting Compute For Datagen (Aphrodite engine)

Training a Model on Augmentoolkit Data IMPORTANT NOTE: if you're creating your Runpod account for the first time in the above video, I would appreciate it if you used this Runpod referral link https://runpod.io?ref=tjhovswf to support Augmentoolkit's creation and open-sourcing of additional datasets.

Augmentoolkit Original Introduction/Hype Video

RPToolkit Introduction/Hype Video

Classifier Creator Demo (set to a Chopin piece no less)

Table of Contents:

  1. Quickstart
  2. Vision (Introduction)
  3. Usage
  4. Each Pipeline In-Depth
  5. Customization
  6. Training a model
  7. Roadmap
  8. Contributing
  9. Community
  10. Sponsorship and Donation
  11. Self Promotion (Read if you're a Business!)
  12. Think this is cool? Connect with me elsewhere!

Quickstart

The quickstart instructions are for the QA pipeline. The process for using other pipelines, or other config files within the QA pipeline, is much the same; just change the folder path and config path in super_config.yaml as well.

Terminal

After installing the dependencies:

There's also a quickstart video that you can follow along with! The default provider has been changed to DeepInfra, you'll need to get a key from them or you'll have to change the base URL to together.

If you want to use PDFs, you will have to install tesseract, which has its own installation instructions: https://github.com/tesseract-ocr/tesseract However the project should work fine without it if you just want to use .txt.

Web UI

  1. Install the dependencies (pip install -r requirements.txt)
  2. Run python streamlit_app.py
  3. In the browser tab that this command opens, add your API key for whatever cloud AI provider you like the most, or a local AI server. Change the base URL as appropriate, too.
  4. Save your changes.
  5. Hit the run pipeline button at the bottom of the panel.

webui.jpg

Vision

Dataset creation has long been the most painful, and most important, step of the finetune-creation process. Most people have to resort to either A) burning an obscene number of OpenAI API credits, after spending a bunch of time making some hacked-together script for their needs, or B) spending hundreds, if not thousands, of hours accumulating a hybrid dataset based off of your own conversations with bots. The OpenAI approach is based on a paid service (whose TOS you're violating) that can ban you at any second, whose writing style you probably hate, which is getting worse every month, and whose synthetic data critically lacks variety. Handwriting the examples is far too slow to iterate on, and does not scale at all, meaning you're missing out on huge potential performance increases that come with more data. If you're a company and you pay people to create examples in bulk, then it's possibly pricier than even OpenAI — also not scalable at all. And moreover, if we're literally creating machines that can write, why do we spend most of our time writing?

Augmentoolkit is meant to make high-quality data generation easy, fast, shareable, configurable, and for everyone. Some of the greatest joy in LLM creation is making an AI for an area you're passionate about; whether this passion is for fiction or a factual domain, Augmentoolkit lets you create the custom data you need to make your dream AI model real.

Having been rebuilt from the ground up to be extensible and configurable, Augmentoolkit is now the best place for any open data generation pipeline to exist on. Adding a new pipeline being as simple as copying a folder. Pipelines themselves can have their prompts switched out in a completely modular manner. Settings are simple to change, too. Finally, a minimalistic but useful set of abstractions make building resumable data generation pipelines easy as pie. Augmentoolkit is more than just a pipeline — it's more than just three pipelines, even! It's THE place for model creators to build their datasets, whether they're professionals or hobbyists. And it's an evolving open-source project with more added every month.

Augmentoolkit allows any enthusiast, regardless of computer strength, to contribute to the advancement of AI by generating swathes of data for cheap or by designing and contributing a pipeline for a new and important task. The Augmentoolkit project strives to expand the possibilities of what finetunes can be built, by making data gathering as easy as running a script. Whether you're finetuning a company chatbot to understand your business's information, are creating an AI ambassador for your community that can explain your mission and goals, or are doing something else entirely, Augmentoolkit exists to make your data problems a bit less problematic.

We're going to make dataset creation the most enjoyable, powerful, and flexible part of creating a new LLM.

Right now you can:

Whether you want to train an LLM on your company's knowledge base, create a roleplayer specializing in your favorite genre, or create an AI expert on 18th century military strategy, Augmentoolkit removes 'not enough data' as an obstacle.

I can't wait to see what you'll build.

Usage

Relevant video

Assuming that you have installed things already, using the quickstart, an overview of the important parts of the project can be found here. Otherwise, follow the instructions below to install and get an understanding of the overall shape of the project.

Installation

First, get the repository onto your computer:

git clone https://github.com/e-p-armstrong/augmentoolkit.git

Then, install the project's dependencies.

pip install -r requirements.txt

You may get some messages saying that torchvision and torchaudio require older versions of Pytorch. This should be safely ignorable.

NOTE it is likely more cost-effective for large scale dataset generation to rent GPUs for a couple bucks/hr on a service like Vast.ai or Runpod, than it is to use APIs like Together.ai. However, APIs are faster and require little setup. So the currently advised process is: experiment with APIs, and generate for production with rented compute.

There are two video guides on local dataset generation with Augmentoolkit, one for running it on your actual computer, and another for renting computers with powerful GPUs and using those to cost effectively generate data.

A note for when you start using Augmentoolkit multiple times: all of Augmentoolkit's pipelines, to some extent, resume previously-started runs if the output folder is not empty. Rename or move it elsewhere if you are not trying to continue interrupted dataset generation, or change the output folder path in the config you're using.

Basics of running Augmentoolkit

The main script of the project is run_augmentoolkit.py. This script uses super_config.yaml to decide which pipelines to execute, in what order, with which settings (config files). A pipeline is a folder that contains the following files: a processing.py, a steps.py, an __init__.py(), and at least one .yaml file with config in the name. Details of what settings should exist in each project's config.yaml can be found in the section of this README devoted to that pipeline.

To change settings (like the API provider, chunk size, whether to skip certain steps, or which prompts preset to use) of an individual pipeline, you change its config file (or add a new one) in its folder. To change which pipeline you run when you run run_augmentoolkit.py you change super_config.yaml.

Super Config

One config to rule them all

The file super_config.yaml lets you choose which pipelines to run. It's a very simple and minimalistic file. Its contents might look like this, for instance:

pipeline_order:
  - folder: "classifier_creator"
    config: "config.yaml"
  - folder: "original"
    config: "config_overrides/groq/groq-normal.yaml"
  - folder: "original"
    config: "config_overrides/groq/groq-negative.yaml"

Each folder field is a relative path (relative to the root folder of the project) to a valid pipeline folder (contains a processing.py and a steps.py etc. at top level). Each config field is a relative path (relative to the pipeline folder specified in folder) that points at a .yaml file that contains settings for that given pipeline. This setup means that one project can have many different config files, and the pipeline operator can switch between them as needed depending on the situation and requirements. This is a benefit for organization.

Pipelines are executed in the order they appear in the pipeline_order from top to bottom.

Each Pipeline In-Depth

QA Generation

QA Overview

The first pipeline to ever be added to Augmentoolkit, QA generation is focused on creating instruct tuning data for specific facts. This can give an LLM a broad understanding of the facts behind a subject. Especially when combined with RAG, this can produce a bot that is decent at answering factual questions on a specific domain — in other words, this is great for creating domain experts.

The QA pipeline also comes bundled with three prompt override suites by default. open-ended prompts (original/prompt_overrides/prompts_override_open-ended_questions) create long and detailed single questions, while negative prompts (original/prompt_overrides/prompts_override_negative_questions) help defend against hallucination.

QA Config, Step-by-Step

You can easily customize Augmentoolkit's original pipeline by changing the settings in config.yaml or one of the other configs in that pipeline. Augmentoolkit's QA pipeline, specifically, has a wide variety of prebuilt configs for a number of different API providers and local AI servers (Ollama, llama.cpp, Aphrodite Engine, etc.). Let's walk through each field in the YAML file so that you can understand how to change it to suit your needs:

First up, we have the API section:

API:
  LARGE_API_KEY: key-here
  LARGE_MODEL: meta-llama/Meta-Llama-3.1-70B-Instruct
  LARGE_BASE_URL: https://api.deepinfra.com/v1/openai
  LARGE_MODE: api
  SMALL_MODEL: meta-llama/Meta-Llama-3.1-8B-Instruct
  SMALL_BASE_URL: https://api.deepinfra.com/v1/openai
  SMALL_API_KEY: key-here
  SMALL_MODE: api

Field-by-field:

Following this, we have the HUGGINGFACE section:

HUGGINGFACE:
  HUB_PATH: yourusername/your-path-here
  PRIVATE: False
  PUSH_TO_HUB: False

This section lets you automatically push your generated dataset to the HuggingFace Hub once it is finished generating. There is a bit of configuration:

Next up, we have the PATH section:

PATH:
  INPUT: "./raw_text_input_vision_paper"
  OUTPUT: "./output"
  DEFAULT_PROMPTS: "./prompts"
  PROMPTS: ./prompts_vision_paper

Field-by-field:

PHASE is left to the end of this step-by-step since it's a bit nuanced.

Briefly, we have the SKIP section:

SKIP:
  ANSWER_RELEVANCY_CHECK: False
  FILTER_CHUNKS: False
  QUESTION_CHECK: False
  CONVERSATION_GENERATION: False
  REPAIR_QA_TUPLES: True

Very simply, this section lets you skip certain parts of the QA pipeline. All of these are currently validation steps: they will just act as if everything came out as True (passed). This is useful for certain types of data — for instance, if the filter_chunks step keeps deciding that much of your data is "not suitable for questions" even if it is just unconventional, then you can solve this problem by skipping the step. This is a tradeoff, however: skipping these steps can lead to lower-quality data, especially under normal circumstances.

IMPORTANT If you want to use the "negative" prompt overrides, you have to turn skip answer relevancy check on!!!

Next, we have the SYSTEM section:

SYSTEM:
  CHUNK_SIZE: 1900
  USE_FILENAMES: False
  COMPLETION_MODE: false
  CONCURRENCY_LIMIT: 60
  DOUBLE_CHECK_COUNTER: 1
  DO_NOT_USE_SYSTEM_PROMPTS: True
  FINAL_ASSISTANT_PROMPTS_NO_RAG: [
  'You are a helpful AI assistant.',
  'You are A VASTLY intelligent ARTIFICIAL INTELLIGENCE with DOMAIN-EXPERT KNOWLEDGE from a variety of fields.

  USE your knowledge to be helpful and truthfully answer questions about the world.',
  "u are ai asstant plz answr questions"] # a wide variety of system prompts helps the AI learn better. What, you expect your users to spell things right?
  FINAL_ASSISTANT_PROMPTS_RAG: [
  'You are a helpful AI assistant. Some knowledge:

  {data}',

  '{data}

  You are an AI domain expert. Answer questions',
  'You are an AI with vast knowledge. Here is some potentially-relevant context:

  {data}

  Answer questions according to your knowledge.']
  MODE: api
  STOP: true
  SUBSET_SIZE: 10
  USE_SUBSET: true

Field-by-field:

Finally, PHASE:

One constraint of local generation is that you can only run one model at once. Augmentoolkit typically uses two different models: a small one for bulk work, and a large smart one for tough tasks. To still use small, efficient models for bulk work and large ones for the difficult steps, we have to run a pipeline with one model, stop at the point where the model we're using changes, run it again with a different model, and so on until the whole thing is done. PHASE exists to make this process easier.

The process is: turn WORK_IN_PHASES to True, and set PHASE_INDEX according to how far along your dataset generation run you are. For QA generation, phase index 0 = filtering out chunks with no relevant context, and uses small models; index 1 = question generation, uses large models; index 2 = question validation, answer relevancy validation, and answer accuracy validation, uses small models; index 3 = context revision and conversation generation, the final phase, uses large models.

Start up your local openai-compatible LLM server, with a smaller model. Set the config to this:

PHASE:
  WORK_IN_PHASES: True
  PHASE_INDEX: 0

get all your other settings in place (input texts, base_url, etc.), and run run_augmentoolkit.py. When that finishes, change the config to:

PHASE:
  WORK_IN_PHASES: True
  PHASE_INDEX: 1

and restart your local LLM server to use a larger and more powerful LLM. Then run run_augmentoolkit.py again — it will pick up where you left off, thanks to Augmentoolkit's auto-resume feature. When that step completes, set the config to

PHASE:
  WORK_IN_PHASES: True
  PHASE_INDEX: 2

and have your local LLM server use a small model. Finally, once that is done, go ahead and run phase 3 with a large model:

PHASE:
  WORK_IN_PHASES: True
  PHASE_INDEX: 3

This process replaces the more-cumbersome approach of having two separate files for local inference. Now, you manage it from the config. If you want to "set it and forget it" with your datagen run, you can just eat the longer generation time of using a more powerful model for everything, it won't hurt you. Unless you're using rented compute, in which case the slower speeds will mean more hours of renting, and more cost, which might hurt a bit.

To speed up generation and get cost efficiency, it may be best to rent compute using Runpod.io or a similar GPU renting service (recommend either 2x H100s, or 8x A40s). For large-scale dataset generation tasks this will likely be cheaper than using an API, and it doesn't suffer from quite the same painful generation speed problems that consumer hardware can face sometimes.

If WORK_IN_PHASES is off, the whole pipeline will execute when you run the script.

Happy dataset generation! Enjoy making awesome domain experts, now that data is finally an easy part of the process.

QA Visual Explanation of Steps

Here is a flowchart detailing how a typical run of Augmentoolkit's QA pipeline may proceed. The source text can be anything with information you can ask questions about.

QA What to do with the outputs

The important files to look out for in your OUTPUT folder are simplified_data_no_rag.jsonl, simplified_data_rag.jsonl, and pretraining.json. These are what you will most commonly use for training. The other top-level files are there incase you want more information, such as the chunk and name of the file that each conversation was generated from. But for training, you will want simplified_data_no_rag.jsonl, simplified_data_rag.jsonl, and pretraining.json. All are already formatted for use with the Axolotl open-source training library. All you need to do is use these datasets like how the provided configs in _model_training_configs/ are used.

The format of the conversational files is called "ShareGPT", and is a common format across many datasets. pretraining.json however is formatted as pretraining data. To bake factual information into an LLM, it is recommended you use a full finetune or (cheaper) GaLore tuning, combined with continued pretraining on the source text + the instruct data that Augmentoolkit generates. If you want a more in-depth example, check out the provided configs, or the second video of the Video Documentation.

In a recent update, Augmentoolkit gained the functionality where you get data from the generation of questions, filtering of input chunks, and conversation generation, as well. These can be identified by being .jsonl files with _DATAGEN_OUTPUT in their filenames. You'll understand what exactly they are when you look at one.

They're in ShareGPT format for easy training, and can be used to bulk up a training run by acting as yet more diverse data on the given subject. They can also be used to make LLMs that are experts in running as part of Augmentoolkit specifically — train a model on enough of these, and you will get a powerful tool for local inference.

QA Quirks and Tips

RPToolkit

RPToolkit, as a pipeline, is contained within the larger Augmentoolkit project (which has a few other pipelines for other uses). Click here to go to the top of the README. Click here to see the table of contents.

RPToolkit Overview and Quickstart

RPToolkit is the answer to people who have always wanted to train AI models on their favorite genre or stories. This pipeline creates varied, rich, detailed, multi-turn roleplaying data based on the themes, genre, and emotional content of input stories. You can configure the kind of data you generate through the settings or, better still, by changing the input data you supply to the pipeline.

The writing quality and length of the final data in this pipeline is enhanced through a painstakingly-crafted 22-thousand-token prompt.

Here's how to run this pipeline (a quickstart):

pip install -r requirements.txt

Change super_config.yaml to be:

pipeline_order:
  - folder: "rptoolkit"
    config: "config.yaml"

Add your API key for fireworks.ai to rptoolkit/config.yaml. If you want to use a different provider, change the BASE_URL to that provider's OpenAI-compatible API.

Then run python run_augmentoolkit.py.

RPToolkit Config Step-by-Step

First up, we have the API section. RPToolkit's API section is basically the same as the QA pipeline, except allowing finer control.

API:
  API_KEY_A: key
  API_KEY_B: key2
  BASE_URL_A: https://api.together.xyz
  BASE_URL_B: https://api.fireworks.ai/inference/v1
  LOGICAL_MODEL_A: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
  LOGICAL_MODEL_B: accounts/fireworks/models/llama-v3p1-405b-instruct

Field-by-field:

Next up, we have the PATH field. This is exactly the same as that of the QA pipeline.

PATH:
  DEFAULT_PROMPTS: ./prompts
  INPUT: ./raw_txt_input
  OUTPUT: ./output
  PROMPTS: ./prompts

Field-by-field:

Following this, we have RPToolkit's PHASES step. This is also very similar to that of the QA pipeline.

PHASE:
  WORK_IN_PHASES: False
  PHASE_INDEX: 0

Finally, we have SYSTEM:

SYSTEM:
  COMPLETION_MODE: False
  CONCURRENCY_LIMIT: 3
  CHUNK_SIZE: 1500
  EMOTIONS: ['DOMINANCE', 'FEARLESSNESS', 'EMBARASSMENT', 'NIHILISM',
    'DETERMINATION', 'DESPERATION', 'LOSS', 'NOSTALGIA', 'ANTICIPATION',
    'TRUST', 'FEAR', 'DISORIENTATION', 'DEGRADATION']
  INCLUDE_CHUNK_IN_PROMPT: True
  MODE_A: api
  MODE_B: api
  PICK_EMOTION: True
  RP_PROMPT_END: ''
  RP_PROMPT_START: ''
  STOP: True
  SUBSET_SIZE: 3
  USE_MIN_P: False
  USE_SUBSET: True

Many of these settings are repeated from the QA pipeline, some are not. All will be covered here.

Field-by-field:

RP_PROMPT_END and RP_PROMPT_START are for customizing the system prompts of the data that is produced at the end. The system prompts are formatted in this way in the code:

rp_prompt_start + data_obj["scene_card"] + rp_prompt_end

So, RP_PROMPT_START is a string that is appended to the start of a scene card, and RP_PROMPT_END is appended to the end, to make up a "character card" in the training data. One of the great faults of RPToolkit is that its system prompts need to be far more varied, especially in formats. This is not yet in. In the meantime, you have control over the preambles and ends of the system prompts that are used during the saving of the data, after everything is generated. You should probably leave these blank unless you have specific reason to do otherwise, as the defaults are mostly sensible. Also, consider writing up a quick script to shuffle the order of information in the system prompts before training. I would accept such a contribution to the repo, in fact.

Moving onto the other fields:

RPToolkit Visual Explanation of Steps

RPToolkit What To Do With Outputs

RPToolkit outputs its final, complete RP sessions to the final_outputs folder, inside the output folder. The files are mostly in ShareGPT for easy training, much like the QA pipeline.

full_stories_list_complete_format.json - this file contains every generation and every bit of information that was created for each chunk from the beginning of the pipeline, including intermediate steps. Think of it as a lossless extended format that lets you use this pipeline for other usecases than training if you have them. This file has absolutely every story, regardless of rating. full_stories_list_sharegpt.json - this file contains every single story generated by RPToolkit in your generation run, regardless of rating. This means that everything from the lowest quality story to the highest quality story is there. good_and_above_stories_list_complete_format - the same as full_stories_list_complete_format.json but filtered to only include stories with all categories ranked as "good" or above by the rating AI. good_and_above_stories_list_sharegpt - Same as full_stories_list_sharegpt.json but filtered to only include stories with all categories ranked as "good" or above by the rating AI. incredible_stories_list_complete_format - the same as full_stories_list_complete_format.json but filtered to only include stories with all categories ranked as "incredible" by the rating AI. incredible_stories_list_sharegpt - Same as full_stories_list_sharegpt.json but filtered to only include stories with all categories ranked as "incredible" or above by the rating AI.

As for intermediate outputs: all intermediate outputs are in a folder named for the step (emotion_generation, feature_extraction, etc.). There are two subfolders in each of these folders, one containing .yaml files that are to be used for debugging or seeing what the AI has done; and .json files meant to be read by the pipeline in the event it is continuing a previous run.

RPToolkit Quirks and Tips


Classifier Creator

Classifier Overview and Quickstart

The classifier creator lets you train a whole classification model in minutes. Generation can be done locally or via an API, while model training is done locally on the CPU (classifiers are just that easy to train!)

When do you want a classifier? Maybe you want to go through a dataset and classify data as "high-quality" or "low-quality" and train on only the high-quality stuff. Or, maybe you want to make some custom moderation for an application. Or, maybe you want to hunt through a large amount of text for specific kinds of information. Classifiers are old-school, but they're pretty cool and surprisingly useful nonetheless.

Here's how to run it (a quickstart).

pip install -r requirements.txt

Change super_config.yaml to be:

pipeline_order:
  - folder: "classifier_creator"
    config: "config.yaml"

Then, download the IMDb dataset from Hugging Face:

And put it in the "input" folder pointed to by the classifier_creator/config.yaml file.

Add your API key and your favorite open-source AI API provider to that same file.

Then run: python run_augmentoolkit.py

Prompts for this new pipeline can be found in prompts_classifier.

NOTE that the classifier creator can also take .json, .jsonl, and .parquet files as input, if they have a "text" column! This lets you use off-the-shelf datasets from Hugging Face, such as Enron emails or FineWeb!

Key features at a glance:

Don't hesitate to reach out if you have any questions about the new pipeline or Augmentoolkit! My contacts are at the bottom of this readme.

Classifier Config Step-by-Step

Most of the config settings are the same as Augmentoolkit's QA pipeline, but here are the points of difference:

Classifier Visual Explanation of Steps

Classifier Quirks and Tips


Customization

I said before that Augmentoolkit was (re)built to be extensible, customizable, and modular. I was not kidding! While some other parts of this README have covered config settings and the overall 'shape' of the project, this part is dedicated to some information that should help you if/when you decide to build your own pipelines, or make contributions to the codebase.

TLDR key points: the PipelineStep() is what you should use for most LLM calls, and by convention in Augmentoolkit, we pass information through a pipeline as a list of dicts and use the keys of the dict to format values into LLM prompts.

Abstractions

Let's first talk about the main abstractions that you'll see throughout Augmentoolkit. There are not too many of them, but they are useful, and you need to know how they work if you're going to work in this codebase.

From processing to the engine wrapper: how inputs travel

It's useful to know how inputs are passed along the code of Augmentoolkit, from start to finish, so that you can understand what the inputs to any of the given intermediate functions are.

So here's a description. It's pretty long and recurses through much of the process, even getting decently low-level. It's only really recommended if you're going to be building your own pipeline to add onto Augmentoolkit. Also, if my explanations are bad, the locations for each important class are given so you can look at the code by itself.

At the start of a pipeline, text is usually read from its input files as a string, and then broken into a list of dicts resembling {"paragraph": "chunk contents would go here", "metadata": "the filename that the chunk belonged to originally"} by some chunking algorithm. For the rest of the pipeline, the main store of information will be a list of dicts.

Typically the list of dicts is updated over the course of a pipeline by mapping an LLM-calling function over it asynchronously. The function will be passed a dict from the list,

tasks = [
        steps.generate_qadicts_from_para(
            idx,
            para,
            engine_wrapper_large=engine_wrapper_large,
            generated_qa_dicts=generated_qa_dicts,
        )
        for idx, para in enumerate(filtered_worthy_for_questions)
    ]

and in turn will use its information as input to an LLM.

question_generation_step = QuestionGenerationStep() # this is an instance of PipelineStep which we will get to soon.

# Question generation
async def generate_qadicts_from_para(
    idx,
    para,
    engine_wrapper_large=None,
    generated_qa_dicts=None,
):
    # NOTE Set up qatuple generation step #

    await question_generation_step.run(
        idx=idx,
        input_data=para,
        engine_wrapper=engine_wrapper_large,
        output_list=generated_qa_dicts
    )

Once it gets back a response, the function will create a new dict with a new key-value pair (containing the response, or a processed version of it) and will append the new object to an output list.

So if we start with

{"paragraph": "chunk contents would go here", "metadata": "the filename that the chunk belonged to originally"}

after a step finishes, we might have each object in the OUTPUT list being something like:

{"paragraph": "chunk contents would go here", "metadata": "the filename that the chunk belonged to originally", "foo": "bar"}

typically after a step is done, the output list is used as the input list for whatever step is next.

To go a bit deeper, you saw how the generate_qadicts_from_para() function basically just passed its inputs to a method of a certain QuestionGenerationStep? That's a subclass of PipelineStep. .run() is a method of PipelineStep. It passes the input dict down to a GenerationStep, which passes it onto the EngineWrapper, which actually sends the request and gets the response. We'll go over the role of each of these classes now.

Pipeline Step

Location: augmentoolkit/generation_functions/pipeline_step_class.py

The pipeline step handles:

This class also stores all the settings a given step of the pipeline could possibly need. If, fundamentally, the units of an LLM call are the prompt, the LLM, and the sampling parameters, then the PipelineStep stores the sampling parameters and the path to the prompt, while one of the arguments to .run is the engine_wrapper, i.e., the model.

You will likely not have to change the PipelineStep file itself, but to achieve specific functionality it is likely you will have to override it at times. See how RPToolkit does depth-first generation by making a subclass, and how the original pipeline creates many subclasses that override specific methods in order to get certain behavior. The PipelineStep can usually be used as-is, but object oriented stuff is really taken advantage of in order to reduce clunky boilerplate while also allowing for as much flexibility as possible in pipeline design.

Generation Step

Location: augmentoolkit/generation_functions/generation_step_class.py

The Generation Step handles putting together the requests that are sent into the engine wrapper (an engine wrapper is always passed to a generation step as one of its initialization arguments). This includes formatting stuff into the prompt. That is important, so let's talk about it.

You know how input lists in Augmentoolkit, which pipeline steps' .run() methods are mapped over, are basically a list of dicts?

{"paragraph": "chunk contents would go here", "metadata": "the filename that the chunk belonged to originally"}

The keys of these are really important. because a prompt file might look like this (highly simplified):

- role: user
  content: |
    Text: """{paragraph}"""

    Filename: {metadata}
    --------
    Classify whether this text is a table of contents or not

Specifically: the keys of input objects are used to interpolate values in that step's prompt. the GenerationStep class automatically handles this: if you put together the above prompt and dict, you send to the AI server something like:

- role: user
  content: |
    Text: """chunk contents would go here"""

    Filename: the filename that the chunk belonged to originally
    --------
    Classify whether this text is a table of contents or not

This is how prompt formatting is done in Augmentoolkit: it is based on the names of the keys in an input data object. Those names must line up with what is in the prompts. The GenerationStep handles this formatting and a bit more. If you want to truly understand how it works you will have to look at the code -- the objective of this section of the README is not to exhaustively explain what every line does, but to give a high-level understanding that will help you read the code faster and grasp it easier.

You probably won't change this file that much, but basically any LLM call will rely on it. It's important to know how prompts are formatted here. Furthermore, some slightly older parts of certain pipelines (such as Augmentoolkit's question validation) still use GenerationSteps without pipeline steps, due to the really unconventional control flow of those sections. So there's a chance you'll need to use this class yourself after all.

Anyway.

Once a prompt is formatted, it is sent off to the EngineWrapper.

Engine Wrapper

Location: augmentoolkit/generation_functions/engine_wrapper_class.py

The Engine Wrapper is a single class that allows you to call all sorts of different APIs, with all sorts of different settings. It simplifies async calls, and uses streaming to avoid timeouts on long generation tasks.

An engine wrapper is instantiated with a model, api key, base url, and mode. This object is usually then passed around a pipeline — after being instantiated in processing.py an EngineWrapper object will typically be passed into the .run() method of pipeline steps, which then pass it into GenerationSteps which then call the Wrapper's .submit_chat() or .submit_completion() methods. Engine wrappers don't store any of the sampling parameters (e.g., temperature) of an API call; just the destination, kind of API, and what model is being used.

If you want to add a new API (e.g., Anthropic) you would only have to change this file. Supporting different modes is simply an if-statement, you can see how it's done with cohere right now:

elif self.mode == "cohere":
            timed_out = False
            completion = ""
            messages_cohereified = [
                {
                    "role": "USER" if message["role"] == "user" else "CHATBOT",
                    "message": message["content"],
                }
                for message in messages
            ]
            # ...etc...

You will likely see, and use, EngineWrappers in every pipeline you build. They are essentially part of the boilerplate that pipelines start off with — "read the config, chunk the text, and define your engine wrappers, one for each model" is the generic process at the start of each pipeline.

Creating a New Pipeline

Now that we've talked about some of the code, let's talk about something a bit lighter: what to name stuff and where to put it, when making your own Augmentoolkit-style dataset generation pipeline.

If you are more of a doer than a reader, you can go over to ./BOILERPLATE_TO_MAKE_YOUR_OWN_PIPELINE, there's a project skeleton there that runs and serves as a minimalistic example to play with and make your own dataset generation pipelines with. And it follows all the conventions in this section already.

Naming conventions and folder structure

Every pipeline needs a processing.py, a steps.py, an __init__.py, and at least one .yaml file with config in its name. It will also, almost certainly, need some kind of prompts folder.

processing.py, steps.py, and __init__.py need to be top level in the project folder. The config does not have to be.

But what do each of these files do? What's the logic behind the organization?

processing.py is meant to be where you put the control flow. It's the main entry point of the function: when Augmentoolkit runs a pipeline, it runs processing.py.

steps.py is where you put helper functions, as well as generation functions (i.e., functions that make LLM calls) to be imported by processing.py.

And you know about the config already, that's where you put settings.

__init__.py is just needed by Python for imports and can be empty.

Code must-dos

This README has already covered the most of the heavy stuff around code in Augmentoolkit. This very brief section exists to cover a handful of "gotchas" and footguns.

  1. For fields in your config that are not strings, convert the datatypes after loading them:

    from augmentoolkit.utils.parse_bool import parse_bool
    # ...
    CONCURRENCY_LIMIT = int(obj_conf["SYSTEM"]["CONCURRENCY_LIMIT"])
    USE_STOP = parse_bool(obj_conf["SYSTEM"]["STOP"])
    USE_MIN_P = parse_bool(obj_conf["SYSTEM"]["USE_MIN_P"])
    # from: BOILERPLATE_TO_MAKE_YOUR_OWN_PIPELINE/steps.py

    This is because of the relative newness of the GUI, which does not respect datatypes and currently saves everything as strings. I am not a streamlit expert, so until we get a PR that respects the datatypes of fields in config.yaml files, we need to convert stuff like this.

  2. You should make paths that you read in from the config absolute paths within your python files.

# from: BOILERPLATE_TO_MAKE_YOUR_OWN_PIPELINE/steps.py
OUTPUT = os.path.abspath(obj_conf["PATH"]["OUTPUT"])
DEFAULT_PROMPTS = os.path.abspath(obj_conf["PATH"]["DEFAULT_PROMPTS"])
PROMPTS = os.path.abspath(obj_conf["PATH"]["PROMPTS"])

I don't quite recall why I started doing this, but I remember vague problems when I did not. So, to avoid vague problems, you should also start doing this.

  1. Extract the path to the config that the project is going to use like so:
    config_path = os.environ["CONFIG_PATH"]
    with open (config_path, "r") as file:
    obj_conf = yaml.safe_load(file)

    run_augmentoolkit.py uses environment variables to communicate to each pipeline's processing.py what config it wants it to use.

There's a risk I've missed something in this list of gotchas: if you stumble into a strange and arcane problem while building a pipeline that is my fault, please create an issue so I can fix it!

Config Structure

You can pretty much do anything you want with config structure, just don't nest things more than one level deep. By that I mean:

KEY:
  ANOTHER_KEY: 1

^ is fine but

KEY:
  ANOTHER_KEY:
    WHOA: 1

is bad

If you make a new pipeline

You should open source it! If you've made something cool I'd be honored to add your new pipeline to the Augmentoolkit project with you as a contributor, so that we can continue to make dataset generation more open for all.

Training a model

Augmentoolkit comes with a few prebuilt Axolotl configs that you can use to train a custom model on the data that you get from its pipelines. However, you are encouraged to tune the hyperparameters and other settings to your specific use case.

There's also a video showing you how to do it: https://youtu.be/dby8y4hkJQU IMPORTANT NOTE: if you're creating your Runpod account for the first time in the above video, I would appreciate it if you used this Runpod referral link https://runpod.io?ref=tjhovswf to support Augmentoolkit's creation and open-sourcing of additional datasets.

Roadmap

In the coming weeks and months, Augmentoolkit will be expanded with additional pipelines, capabilities, and updates. I'm working in collaboration with AlignmentLab AI for some of this!

One specific pipeline coming up is ultra-long context instruct data. Let me know if there are other kinds of pipelines you'd like to see, and I'll add them too!

Also thinking about maybe an annotation pipeline...

And, of course, anything awesome that you invent I'd be happy to have here as well. Collaboration is a great part of open source!

Community

Augmentoolkit has a vision of democratizing dataset generation. That's a pretty community-oriented thing, so it only makes sense for us to have a community hub! Come join the Augmentoolkit Discord server to chat with fellow AI people, get support, and share the awesome stuff you're making.

Also, you can find all the Augmentoolkit help videos — and soon, additional fun and informative AI things related to datagen and the project — on this YouTube channel.

Donation

If you want to donate to the development of Augmentoolkit and continued open-sourcing of models using this tech, you can do so with this ko-fi donation link. It's greatly appreciated! For sponsorship inquiries related to the Augmentoolkit project, please reach out via socials, Discord, or email (contact info at bottom of repo).


For Businesses

I work with AI startups and companies that want to create (or improve) specialized LLMs using lots of quality training data. Do you need a great dataset for your business's AI? Or do you want to apply AI models that you own to a profitable niche that generalist ones are struggling with? I'd be happy to help you painlessly create the custom dataset (and custom data pipeline) you need, as well as the documentation to expand on these tools. Given that I made the original version of this thing, I'm probably the best person in the world for this task. You can schedule a quick call to talk about your needs with me using this Calendly link: https://calendly.com/evanpeterarmstrong/discovery-call. I'm not just looking for some side gig; I do this for a living.

Note The base version Augmentoolkit is fully open sourced and MIT-licensed. The consulting option is for people who want a bespoke modification (or even a whole new custom pipeline) and guaranteed quality results, fast (it took 13 months of learning and iteration for me to make Augmentoolkit work like it does now). A collaboration would be zero-risk, you have a money-back guarantee.


Think this is cool? Connect with me elsewhere!

If you think this project is cool and useful, great! I'm genuinely happy that you're interested by my work. If you're really interested by this project you might be interested by some of my other endeavors:

Contributing

Contributions are appreciated! Whether it's a new API endpoint, or a set of prompts you've found to work really well, or an entirely new pipeline, please submit a PR! Reviews are fast here. Anything that can further the goal of democratized dataset generation is welcome.