e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets
MIT License
584 stars 79 forks source link

Augmentoolkit — infinite domain-specific instruct data

Turn any raw text into a high-quality dataset using local models. Make data gathering a painless step of the model creation process. Augmentoolkit is the easy-to-use, customizable, open-source, and cost-effective data generation solution. No OpenAI needed.

Augmentoolkit is an AI-powered tool that lets you create domain-specific data to finetune LLMs, using open-source AI.

Cite: DOI

Benefits

Augmentoolkit makes LLM data easy.

We've also done our best to facilitate the step after you generate your data -- training your LLM:

Finally, using the model you create should be easy and valuable:

Demo video & Video Tutorials:

3-Minute Demo Video Here

Note that Video Documentation is currently built for Augmentoolkit, a sister project of Augmentoolkit built for the Verus community. The process of running it should be the same, and the process of training the LLM is definitely the same. But when it mentions "Augmentoolkit" and the Verus project, that is why.

Augmentoolkit-specific video docs are in the works.

Video Documentation 1 — Dataset generation

Video Documentation 2 — Model Training, Quantizing, and Chatting

Table of Contents:

  1. Quickstart
  2. Self Promotion (read if you're a business!)
  3. Vision (Introduction)
  4. Usage
  5. What to do with what you get out
  6. Roadmap
  7. Community
  8. Latest Update Info
  9. Think this is cool? Connect with me elsewhere!
  10. Contributing
  11. Join A Discord for Dataset Generation!
  12. Using "Aphrodite mode" (deprecated)

Quickstart

Terminal

After installing the dependencies:

Web UI

  1. Install the dependencies (pip install -r requirements.txt)
  2. Find the absolute path to the raw_txt_input folder
  3. Run export GRADIO_TEMP_DIR=<raw_txt_input_absolute_path>
  4. Run python app.py

webui.jpg

For Businesses

I work with AI startups that want to create (or improve) specialized LLMs using lots of quality training data. Do you need a dataset for your business's AI? I can modify Augmentoolkit for any domain and for tasks beyond question answering, and I'd be happy to help you painlessly create the data — and data-creation tools — you require. Given that I made the original version of the darn thing, I'm probably the best person in the world for this task. You can schedule a quick call to talk about your needs with me using this Calendly link: https://calendly.com/evanpeterarmstrong/discovery-call.

Note The base version Augmentoolkit is fully open sourced and MIT-licensed. The consulting option is for people who want a bespoke modification and quality results, fast (it took 5 months of learning and iteration for me to master open source model pipelines enough to make Augmentoolkit work well). If you're a hobbyist and have time to experiment with its base version for casual or personal uses, by all means go for it!

Vision

Dataset creation is currently the most painful, and most important, step of the finetune-creation process. Most people have to resort to either A) burning an obscene number of OpenAI API credits, or B) spending dozens, if not hundreds, of hours accumulating a hybrid dataset based off of your own conversations with bots. The OpenAI approach is based on a paid service (whose TOS you're violating) that can ban you at any second, whose writing style you probably hate, which is getting worse every month, and whose synthetic data critically lacks variety. Handwriting the examples is far too slow to iterate on, and does not scale at all, meaning you're missing out on huge potential performance increases that come with more data. If you're a company and you pay people to create examples in bulk, then it's possibly pricier than even OpenAI — also not scalable at all. And moreover, if we're literally creating machines that can write, why do we spend most of our time writing?

Augmentoolkit is meant to make high-quality data generation easy, fast, shareable, configurable, and for everyone. It is meant to allow the easy creation of datasets about any knowledge that exists in plain text. It is meant to allow models to bootstrap additional training data for themselves. It is meant to allow any enthusiast, regardless of computer strength, to contribute to the advancement of AI by generating swathes of data for cheap. It's meant to expand the possibilities of what finetunes can be built, by making data gathering as easy as running a script. Whether you're finetuning a company chatbot to understand your business's information, are creating an AI ambassador for your community that can explain your mission and goals, or are doing something else entirely, Augmentoolkit exists to make your data problems a bit less problematic.

A flowchart of Augmentoolkit's operation can be found in the Usage section.

The high-level is: books or manuals in, information-rich conversations out. Train the model on the conversations, it learns the information. Extensive validation keeps hallucinations to a minimum.

More in-depth and jargon-filled: Augmentoolkit takes human-written text with information in it, and turns it into instruct-tuning data. Basically, it uses LLMs to convert pretraining data into conversational multi-turn QA data:

Usage

Installation

First, get the repository onto your computer:

git clone https://github.com/e-p-armstrong/augmentool.git

Then, install the project's dependencies.

pip install -r requirements.txt

You may get some messages saying that torchvision and torchaudio require older versions of Pytorch. This should be safely ignorable.

If you want to use Aphrodite inside the code, you'll also need to add

pip install aphrodite-engine

However, it is recommended to just run whatever local inference engine you're using in another window, and put its API endpoint in config.yaml, rather than using the built-in aphrodite mode. That mode can be considered deprecated at this point.

NOTE it is likely more cost-effective for large scale dataset generation to rent GPUs for a couple bucks/hr on a service like Vast.ai or Runpod, than it is to use APIs like Together.ai. However, APIs are faster and require little setup. So the currently advised process is: experiment with APIs, and generate for production with rented compute.

I will make a video tutorial on local dataset generation with Augmentoolkit sometime soon.

Augmentoolkit resumes previously-started runs if the output folder is not empty. Rename or move it elsewhere if you are not trying to continue interrupted dataset generation.

Config.yaml, Step-by-Step

You can easily customize how a given run of Augmentoolkit proceeds by modifying config.yaml. The WebUI also has the ability to customize settings. Let's walk through each field in the YAML file so that you can understand how to change it to suit your needs:

First up, we have the API section:

API:
  API_KEY: your key here
  BASE_URL: https://api.together.xyz
  LARGE_LOGICAL_MODEL: meta-llama/Llama-3-70b-chat-hf
  LOGICAL_MODEL: meta-llama/Llama-3-70b-chat-hf
  QUANTIZATION_SMALL: "gptq"
  QUANTIZATION_LARGE: "gptq"

Field-by-field:

Next up, we have the PATH section:

PATH:
  INPUT: "./raw_text_input_vision_paper"
  OUTPUT: "./output"
  DEFAULT_PROMPTS: "./prompts"
  PROMPTS: ./prompts_vision_paper

Field-by-field:

Next, we have the SYSTEM section:

SYSTEM:
  CHUNK_SIZE: 1900
  USE_FILENAMES: False
  COMPLETION_MODE: false
  CONCURRENCY_LIMIT: 60
  DOUBLE_CHECK_COUNTER: 1
  FINAL_ASSISTANT_PROMPT_NO_RAG: |
   You are a helpful, friendly AI assistant.
  FINAL_ASSISTANT_PROMPT_RAG: |
   You are a helpful, friendly AI assistant.

   Context information is below:

   ----------------------
   {data}
  MODE: api
  STOP: true
  SUBSET_SIZE: 10
  USE_SUBSET: true

Field-by-field:

Note:

SKIP:
  QUESTION_CHECK: False
  ANSWER_RELEVANCY_CHECK: True # turn on if using the negative question prompt override

This lets you control whether you want to skip certain steps of the pipeline. QUESTION_CHECK should generally not be skipped under any circumstances, but ANSWER_RELEVANCY_CHECK may be skipped if you are using the "negative" prompt overrides, by default located in ./prompts_override_negative_question. So, turn any one of these on if you want the corresponding step to simply be skipped. These options allow a degree of control flow control, without touching code.

Important Files (If you're modifying the code)

Starting from more common things to less common things:

Visual Explanation of Steps

Here is a flowchart detailing how a typical run of Augmentoolkit may proceed. The source text can be anything with information you can ask questions about.

What to do with what you get out

The important files to look out for in your OUTPUT folder are simplified_data_no_rag.jsonl, simplified_data_rag.jsonl, and pretraining.json. These are what you will most commonly use for training. The other top-level files are there incase you want more information, such as the chunk and name of the file that each conversation was generated from. But for training, you will want simplified_data_no_rag.jsonl, simplified_data_rag.jsonl, and pretraining.json. All are already formatted for use with the Axolotl open-source training library. All you need to do is use these datasets like how the provided configs in _model_training_configs/ are used.

The format of the conversational files is called "shareGPT", and is a common format across many datasets. pretraining.json however is formatted as pretraining data. To bake factual information into an LLM, it is recommended you use a full finetune or (cheaper) GaLore tuning, combined with continued pretraining on the source text + the instruct data that Augmentoolkit generates. If you want a more in-depth example, check out the provided configs, or the second video of the [Video Documentation]().

Roadmap

In the coming weeks and months, I plan to start using Augmentoolkit to produce open-source models in popular, specific domains. Difficulties encountered will fuel further development of this project. Let me know what kinds of LLMs you want built!

Community

Changing Augmentoolkit usually requires a lot of prompting of open-source LLMs. This, and many other aspects of developing with AI, are tricky to get a handle on. I'm making a community to make this easy for you.

This community isn't the typical fake guru "AI tools" hype BS. It's by and for developers.

Currently we have a full video course on open-source prompting, weekly Q&A calls about developing with AI, and optional monthly 1-on-1s with myself, where I help you with your project in any way I can. There's a tight-knit group of professionals and hobbyists like you.

I'd love to have you there. Follow this link to see the landing page and decide if it's something you might be interested in joining.

Latest Update Info

Augmentoolkit has received a major update as of Jun 12, 2024. Two whole new prompt override folders let you generate data designed to make your model smarter and more detailed, while also getting more instruct data for your pretraining buck. Tweaks have been made based on hard-earned experience to optimize for final model quality. Prompt overhauls make the pipeline more flexible and powerful. YAML now is used instead of JSON for prompts because of newline support. An early version of an entirely new datagen program, built to create datasets even without any text input, can be found in ./pure_synthetic_pipeline. There's better logging for common errors. Oh, and the code has been almost entirely refactored!

In the most recent update, RP mode was removed. You can revert to previous versions if you need that. The code is now much cleaner and more maintainable as a result of this.

Many of these changes are inspired by the recently-released Augmentoolkit which I developed for (and as part of) the Verus community. Augmentoolkit is specific for Verus, but it's open-sourced, and so I ported its key improvements back here. Go check it out if you're interested the future of blockchain technology!

Think this is cool? Connect with me elsewhere!

If you think this project is cool and useful, great! I'm genuinely happy that you're interested by my work. If you're really interested by this project you might be interested by some of my other endeavors:

Contributing

Contributions are appreciated! Whether it's a new API endpoint, or a set of prompts you've found to work really well, please submit a PR! Reviews are fast here. Anything that can further the goal of democratized dataset generation is welcome.

Join A Discord for Dataset Generation!

MrDragonFox -- one of the moderators of the Mistral and TheBloke Discords -- has a server where he's working on a new quantization engine. There's a corner to discuss Augmentoolkit there! Come check it out and connect at https://discord.com/invite/foxengine-ai!

Using "Aphrodite mode" (deprecated)

NOTE: Aphrodite mode is pretty much deprecated. If you are running on your own or rented hardware, unless you are doing a truly massive amount of data generation, it is advisable to just use the best model you can for all steps. This means you can just run processing.py as normal.

I'm considering removing aphrodite mode and just letting local inference be done via local OAI-compatible servers, but incase people want it specifically I'm leaving it for now. I hope that this set of instructions is complete enough to follow.

Q: Why do I have to run this in two phases? A: Because it's difficult to switch between two different models when doing local inference.