Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune, instruct-tune the LLMs or to build RAG applications for LLMs. As the variety of use cases grow, so does the need to support:
Data Prep Kit offers implementations of commonly needed data preparation steps, called modules or transforms, for both Code and Language modalities, with vision to extend to images, speech and multimodal data. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.
Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning, RAG or instruction-tuning. Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested while producing pre-training datasets for the Granite open source LLM models.
The modules are built on common frameworks (for Spark and Ray), called the data processing library that allows the developers to build new custom modules that readily scale across a variety of runtimes.
Features of the toolkit:
Data modalities supported today: Code and Natural Language.
With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: examples/notebooks/Run_your_first_transform_colab.ipynb | . (Here are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.)
To run on a local machine, follow these steps to quickly set up and deploy the Data Prep Kit in your virtual Python environment.
conda create -n data-prep-kit -y python=3.11
conda activate data-prep-kit
python --version
Check if the python version is 3.11.
If you are using a linux system, install gcc using the below commands:
conda install gcc_linux-64
conda install gxx_linux-64
Next, install the data prep toolkit library. This library installs both the python and ray versions of the transforms. For better management of dependencies, it is recommended to install the same tagged version of both the library and the transform.
pip3 install 'data-prep-toolkit[ray]==0.2.2.dev1'
pip3 install 'data-prep-toolkit-transforms[ray,all]==0.2.2.dev1'
pip3 install jupyterlab ipykernel ipywidgets
## install custom kernel
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
Test, your installation. If you are able to import these data-prep-kit libraries successfully in python, your installation has succeeded.
## start python interpreter
$ python
# import DPK libraries
>>> from data_processing_ray.runtime.ray import RayTransformLauncher
>>> from data_processing.runtime.pure_python import PythonTransformLauncher
If there are no errors, you are good to go!
Let's try the same simple transform to extract content from PDF files on a local machine.
Local Notebook versions
You can try either one or both of the following two versions:
To run the notebooks, launch jupyter from the same virtual environment you created using the command below.
jupyter lab
After opening the jupyter notebook, change the kernel to dataprepkit
, so all libraries will be properly loaded.
Explore more examples here.
Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning a model or building a RAG application. This notebook gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. Similarly, this notebook is a fine tuning example of an end-to-end sample data pipeline designed for processing language datasets. You can also explore how to build a RAG pipeline here.
The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed here and can be combined to form data processing pipelines, as shown in the examples folder.
Modules | Python-only | Ray | Spark | KFP on Ray |
---|---|---|---|---|
Data Ingestion | ||||
Code (from zip) to Parquet | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
PDF to Parquet | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
HTML to Parquet | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Web to Parquet | :white_check_mark: | |||
Universal (Code & Language) | ||||
Exact dedup filter | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Fuzzy dedup filter | :white_check_mark: | :white_check_mark: | ||
Unique ID annotation | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Filter on annotations | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Profiler | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Resize | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
HAP | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Tokenizer | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Language-only | ||||
Language identification | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Document quality | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Document chunking for RAG | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Text encoder | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
PII Annotator/Redactor | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Code-only | ||||
Programming language annotation | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Code quality annotation | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Malware annotation | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Header cleanser | :white_check_mark: | :white_check_mark: | :white_check_mark: | |
Semantic file ordering | :white_check_mark: | |||
License Select Annotation | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules!
At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular parquet format to store the data (code or language). Every parquet file follows a set schema. A user can use one or more transforms (or modules) as discussed above to process their data. A transform can follow one of the two patterns: annotator or filter.
Annotator An annotator transform adds information during the processing by adding one more columns to the parquet files. The annotator design also allows a user to verify the results of the processing before the actual filtering of the data.
Filter A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. A general purpose SQL-based filter transform enables a powerful mechanism for identifying columns and rows of interest for downstream processing.
For a new module to be added, a user can pick the right design based on the processing to be applied. More details here.
One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an example transform that can serve as a template to add new simple transforms. Follow the step by step tutorial to help you add your own new transform.
For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive overview document alongside dedicated sections on transforms and runtimes.
Additionally, check out our video tutorial for a visual, example-driven guide on adding custom modules.
Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them.
To enable processing of large data volumes leveraging multi-mode clusters, Ray or Spark wrappers are provided, to readily scale out the Python implementations.
A generalized workflow is shown here.
The toolkit also supports transform execution automation based on Kubeflow pipelines (KFP), tested on a locally deployed Kind cluster and external OpenShift clusters. There is an automation to create a Kind cluster and deploy all required components on it. The KFP implementation is based on the KubeRay Operator for creating and managing the Ray cluster and KubeRay API server to interact with the KubeRay operator. An additional framework along with several kfp components is used to simplify the pipeline implementation.
A simple transform pipeline tutorial explains the pipeline creation and execution. In addition, if you want to combine several transformers in a single pipeline, you can look at multi-steps pipeline
When you finish working with the cluster, and want to clean up or destroy it. See the clean up the cluster
You can run transforms via docker image or using virtual environments. This document shows how to run a transform using virtual environment. You can follow this document to run using docker image.
If you use Data Prep Kit in your research, please cite our paper:
@misc{wood2024dataprepkitgettingdataready,
title={Data-Prep-Kit: getting your data ready for LLM application development},
author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh
and Constantin Adam and Abdulhamid Adebayo and Sungeun An and Yuan Chi Chang
and Xuan-Hong Dang and Nirmit Desai and Michele Dolfi and Hajar Emami-Gohari
and Revital Eres and Takuya Goto and Dhiraj Joshi and Yan Koyfman
and Mohammad Nassar and Hima Patel and Paramesvaran Selvam and Yousaf Shah
and Saptha Surendran and Daiki Tsuzuku and Petros Zerfos and Shahrokh Daijavad},
year={2024},
eprint={2409.18164},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2409.18164},
}