argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.63k stars 129 forks source link

[FEATURE] Integrate dspy #244

Closed sutyum closed 10 months ago

sutyum commented 10 months ago

Is your feature request related to a problem? Please describe. When generating a dataset, it has become more and more useful to incorporate a pipeline of calls to larger LLMs in order to generate a teachable dataset for smaller models. For instance a pipeline may involve retrieval using ColBert to fetch accurately information from a large corpus of text in order to generate a domain specific dataset for training a RAG model. Seldom are single calls to a model are going to enough to get great synthetic datasets.

Describe the solution you'd like Using dspy would provide a simple, minimal framework to incorporate pipelines with simple yet powerful constructs as assert for adding self refinement to the pipelines.

Describe alternatives you've considered Alternative would involve a langchain integration, which is also fine. However, langchain tends to get pretty complex to deal with when trying to set up more novel pipelines or deal with smaller models for which the base prompts within langchain are not optimised.

davidberenstein1957 commented 10 months ago

Hi @sutyum, thank you for the suggestion. We have created this discussion about chaining and LLMs, so any input would be very welcome there :)

davidberenstein1957 commented 10 months ago

Will be tackled in the discussion mentioned above.