Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Is your feature request related to a problem? Please describe.
When generating a dataset, it has become more and more useful to incorporate a pipeline of calls to larger LLMs in order to generate a teachable dataset for smaller models. For instance a pipeline may involve retrieval using ColBert to fetch accurately information from a large corpus of text in order to generate a domain specific dataset for training a RAG model. Seldom are single calls to a model are going to enough to get great synthetic datasets.
Describe the solution you'd like
Using dspy would provide a simple, minimal framework to incorporate pipelines with simple yet powerful constructs as assert for adding self refinement to the pipelines.
Describe alternatives you've considered
Alternative would involve a langchain integration, which is also fine. However, langchain tends to get pretty complex to deal with when trying to set up more novel pipelines or deal with smaller models for which the base prompts within langchain are not optimised.
Is your feature request related to a problem? Please describe. When generating a dataset, it has become more and more useful to incorporate a pipeline of calls to larger LLMs in order to generate a teachable dataset for smaller models. For instance a pipeline may involve retrieval using ColBert to fetch accurately information from a large corpus of text in order to generate a domain specific dataset for training a RAG model. Seldom are single calls to a model are going to enough to get great synthetic datasets.
Describe the solution you'd like Using dspy would provide a simple, minimal framework to incorporate pipelines with simple yet powerful constructs as assert for adding self refinement to the pipelines.
Describe alternatives you've considered Alternative would involve a langchain integration, which is also fine. However, langchain tends to get pretty complex to deal with when trying to set up more novel pipelines or deal with smaller models for which the base prompts within langchain are not optimised.