This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents.
This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents. It features optimized performance, GPU acceleration, and customizable output.


Project Structure

├── src/
│   ├── main.py
│   ├── config.py
│   ├── data_loader.py
│   ├── model_setup.py
│   ├── dataset_generator.py
│   ├── validation.py
│   └── utils.py
├── data/
│   └── input/
│       ├── file1.txt
│       ├── file2.pdf
│       └── file3.docx
├── output/
│   ├── raw_dataset.jsonl
│   └── validated_dataset.jsonl
├── requirements.txt
└── README.md


  1. Clone the repository:
git clone https://github.com/ekatraone/alpaca-dataset-generator.git
cd alpaca-dataset-generator
  1. Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  1. Install the required packages:
pip install -r requirements.txt
  1. Download NLTK data:
python -m nltk.downloader punkt stopwords


Open src/config.py and adjust the settings as needed:


  1. Place your input files (.txt, .pdf, .docx) in the data/input/ directory.

  2. Run the script:

python src/main.py --num_examples 1000
  1. The script will generate two files in the output/ directory:
    • raw_dataset.jsonl: Contains all generated examples
    • validated_dataset.jsonl: Contains only the examples that passed validation




For information about the latest releases and changes, please refer to the CHANGELOG.md file.


