for-ai / iterative-data-selection

19 stars 5 forks source link

Selecting Diverse Instructions

This repository contains the official code for the paper: Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement.

KMQ Visualization

Dataset

To download the datasets used in this project, run this script. We used Alpaca, ShareGPT and WizardLM datasets for training and evaluation.

After downloading, datasets will be stored in the data/processed directory of the project.

Coreset Selection

The hyperparameters and configurations are managed by Hydra. The configurations are stored in selection/config/. You should run the code by executing main.py in the selection directory. You can also specify the hyperparameters by command line arguments.

cd selection
python main.py data=[sharegpt|wizardlm] encoder=miniLM coreset=random

The selected indices are stored under selection/indices/.

Finetuning

# Llama-2-7b-hf (with accelerate and deepspeed)
bash scripts/finetune_llama_with_accelerate.sh [INDICES]

Iterative selection is implemented in the scripts/iter/ directory.

Evaluation

bash scripts/eval/{eval}.sh

Reference

This code is based on the following repository:

Citation

If you find this code useful, please cite our paper:

@misc{yu2024diversify,
    title={Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement},
    author={Simon Yu and Liangyu Chen and Sara Ahmadian and Marzieh Fadaee},
    year={2024},
    eprint={2409.11378},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}