CASIA-LM / MoDS

97 stars 9 forks source link

MoDS: Model-oriented Data Selection for Instruction Tuning

This repo contains the codes of MoDS to select valuable instruction data from large-scale datasets for a given LLM.

Introduction

Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction data for a given LLM is still an open problem. To address this problem, in this paper we present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity. First, our approach utilizes a quality evaluation model to filter out the high-quality subset from the original instruction dataset, and then designs an algorithm to further select from the high-quality subset a seed instruction dataset with good coverage. The seed dataset is applied to fine-tune the foundation LLM to obtain an initial instruction-following LLM. Finally, we develop a necessity evaluation model to find out the instruction data which are performed badly in the initial instruction-following LLM and consider them necessary instructions to further improve the LLMs. In this way, we can get a small high-quality, broad-coverage and high-necessity subset from the original instruction datasets. As shown in Figure 1, it presents the architecture of our approach.

Environment Dependencies

transformers==4.31.0
json==2.0.9
pytorch==2.0.1
numpy==1.25.2
sklearn==1.2.1

Stage 1: Quality Evaluation

The quality of instruction data plays a crucial role in the learning of instruction-following capabilities for LLMs. Therefore, to select effective instruction data, we first evaluate the qualities of instruction data and their corresponding response in the large-scale dataset, and then filter out the higher-quality data from it.

Stage 2: Diverse Data Selection for Seed Instructions

After getting a high-quality instruction dataset, we will further select data from it. In order to select diverse instruction data with the maximum coverage, we propose to use K-Center greedy algorithm for data selection.

cd diverse-data-selection
python run.py ../quality-evaluation/high-quality-data.json ./seed-instructions.json top_k

Stage 3: Augmented Data Selection

For different LLMs, as the knowledge and capabilities they learned in the pre-training procedure are different, the instruction tuning data they require will be different as well. For one instruction, if the given LLM could generate a good response, it indicates that the given LLM has owned the ability to handle this type of instruction, and this instruction data is not necessary for the fine-tuning of the LLM. Conversely, if the LLM cannot generate a good response, it suggests that the LLM couldn't effectively process that type of instruction data, and the instruction data is very important and unique for the fine-tuning of the target LLM. In this stage, we will extract these instructions with bad responses to build a augmented dataset for the given LLM.

Stage 4: Fine-tuning with Selected Instructions

Citation

Please cite the paper if you use the data or code in this repo.

@misc{du2023mods,
      title={MoDS: Model-oriented Data Selection for Instruction Tuning}, 
      author={Qianlong Du and Chengqing Zong and Jiajun Zhang},
      year={2023},
      eprint={2311.15653},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Naturally, you should also cite the work of LLaMA2 and Alpaca.