Prompt. Generate Synthetic Data. Train & Align Models.
DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.
Installation pip3 install datadreamer.dev
|
|
demo.py |
Result of demo.py |
---|---|
See the full demo script |
See the synthetic dataset and the trained model |
🚀 For more demonstrations and recipes see the Quick Tour page. |
With DataDreamer you can:
DataDreamer is:
Please cite the DataDreamer paper:
@misc{patel2024datadreamer,
title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows},
author={Ajay Patel and Colin Raffel and Chris Callison-Burch},
year={2024},
eprint={2402.10379},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.
Copyright © 2024, Ajay Patel. Released under the MIT License.
Thank you to the maintainers at Hugging Face and LiteLLM for accepting contributions necessary for DataDreamer and providing upstream support.
ODNI, IARPA: This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.