Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
If you just want to get started, we recommend you check the documentation. Curious, and want to know more? Keep reading!
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback.
Compute is expensive and output quality is important. We help you focus on data quality, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time achieving and keeping high-quality standards for your data.
Ownership of data for fine-tuning your own LLMs is not easy but Distilabel can help you to get started. We integrate AI feedback from any LLM provider out there using one unified API.
Synthesize and judge data with latest research papers while ensuring flexibility, scalability and fault tolerance. So you can focus on improving your data and training your models.
We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:
Community Meetup: listen in or present during one of our bi-weekly events.
Discord: get direct support from the community in #argilla-general and #argilla-help.
Roadmap: plans change but we love to discuss those with our community so feel encouraged to participate.
The Argilla community uses distilabel to create amazing datasets and models.
pip install distilabel --upgrade
Requires Python 3.9+
In addition, the following extras are available:
anthropic
: for using models available in Anthropic API via the AnthropicLLM
integration.cohere
: for using models available in Cohere via the CohereLLM
integration.argilla
: for exporting the generated datasets to Argilla.groq
: for using models available in Groq using groq
Python client via the GroqLLM
integration.hf-inference-endpoints
: for using the Hugging Face Inference Endpoints via the InferenceEndpointsLLM
integration.hf-transformers
: for using models available in transformers package via the TransformersLLM
integration.litellm
: for using LiteLLM
to call any LLM using OpenAI format via the LiteLLM
integration.llama-cpp
: for using llama-cpp-python Python bindings for llama.cpp
via the LlamaCppLLM
integration.mistralai
: for using models available in Mistral AI API via the MistralAILLM
integration.ollama
: for using Ollama and their available models via OllamaLLM
integration.openai
: for using OpenAI API models via the OpenAILLM
integration, or the rest of the integrations based on OpenAI and relying on its client as AnyscaleLLM
, AzureOpenAILLM
, and TogetherLLM
.vertexai
: for using Google Vertex AI proprietary models via the VertexAILLM
integration.vllm
: for using vllm serving engine via the vLLM
integration.sentence-transformers
: for generating sentence embeddings using sentence-transformers.outlines
: for using structured generation of LLMs with outlines.instructor
: for using structured generation of LLMs with Instructor.ray
: for scaling and distributing a pipeline with Ray.faiss-cpu
and faiss-gpu
: for generating sentence embeddings using faiss.text-clustering
: for using text clustering with UMAP and Scikit-learn.minhash
: for using minhash for duplicate detection with datasketch and nltk.To run the following example you must install distilabel
with the hf-inference-endpoints
extra:
pip install "distilabel[hf-inference-endpoints]" --upgrade
Then run:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import TextGeneration
with Pipeline(
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline:
load_dataset = LoadDataFromHub(output_mappings={"prompt": "instruction"})
text_generation = TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
),
)
load_dataset >> text_generation
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
load_dataset.name: {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
text_generation.name: {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
distiset.push_to_hub(repo_id="distilabel-example")
If you build something cool with distilabel
consider adding one of these badges to your dataset or model card.
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
To directly contribute with distilabel
, check our good first issues or open a new one.
@misc{distilabel-argilla-2024,
author = {Álvaro Bartolomé Del Canto and Gabriel Martín Blázquez and Agustín Piqueres Lajarín and Daniel Vila Suero},
title = {Distilabel: An AI Feedback (AIF) framework for building datasets with and for LLMs},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/argilla-io/distilabel}}
}