ekatraone / Alpaca-style-Dataset-Generator

This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents.
MIT License
17 stars 7 forks source link

Alpaca-style Dataset Generator

This project generates a high-quality Alpaca-style dataset from input text files, PDFs, and Word documents. It features optimized performance, GPU acceleration, and customizable output.

Features

Project Structure

alpaca-dataset-generator/
│
├── src/
│   ├── main.py
│   ├── config.py
│   ├── data_loader.py
│   ├── model_setup.py
│   ├── dataset_generator.py
│   ├── validation.py
│   └── utils.py
│
├── data/
│   └── input/
│       ├── file1.txt
│       ├── file2.pdf
│       └── file3.docx
│
├── output/
│   ├── raw_dataset.jsonl
│   └── validated_dataset.jsonl
│
├── requirements.txt
└── README.md

Setup

  1. Clone the repository:
git clone https://github.com/ekatraone/alpaca-dataset-generator.git
cd alpaca-dataset-generator
  1. Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  1. Install the required packages:
pip install -r requirements.txt
  1. Download NLTK data:
python -m nltk.downloader punkt stopwords

Configuration

Open src/config.py and adjust the settings as needed:

Usage

  1. Place your input files (.txt, .pdf, .docx) in the data/input/ directory.

  2. Run the script:

python src/main.py --num_examples 1000
  1. The script will generate two files in the output/ directory:
    • raw_dataset.jsonl: Contains all generated examples
    • validated_dataset.jsonl: Contains only the examples that passed validation

Customization

Troubleshooting

Releases

For information about the latest releases and changes, please refer to the CHANGELOG.md file.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.