OFA-Sys / InsTag

InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
196 stars 7 forks source link
alignment large-language-models llama llama2 natural-language-processing nlp tagging

InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning

We introduce a tool named InsTag for analyzing supervised fine-tuning (SFT) data in LLM aligning with human preference. For local tagging deployment, we release InsTagger, fine-tuned on InsTag results, to tag the queries in SFT data. Through the scope of tags, we sample a 6K subset of open-resourced SFT data to fine-tune LLaMA and LLaMA-2 and the fine-tuned models TagLM-13B-v1.0 and TagLM-13B-v2.0 outperform many open-resourced LLMs on MT-Bench.

🤗 InsTagger Checkpoint • 👉 Online LocalTagger Demo • 📖 Paper

🤖️ TagLM-13B-v1.0 Checkpoint 🤖️ TagLM-13B-v2.0 Checkpoint

What is InsTag?

Foundation language models obtain the instruction-following ability through supervised fine-tuning (SFT). Diversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses. In this work, we propose InsTag, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags. We obtain 6.6K tags to describe comprehensive user queries. We analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data. Based on this observation, we propose a data selector based on InsTag to select 6K diverse and complex samples from open-source datasets and fine-tune models on InsTag-selected data. These models outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity.

InsTag

News

Contents

InsTagger

InsTagger is a LLaMa-2 based SFT model trained with FastChat in the vicuna template. You can easily download weight at HuggingFace ModelHub and then use FastChat to serve or inference. Demo codes are about to be released.

Model Checkpoints

Citation

Please cite our work if you find the repository helpful.

@misc{lu2023instag,
      title={#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models}, 
      author={Keming Lu and Hongyi Yuan and Zheng Yuan and Runji Lin and Junyang Lin and Chuanqi Tan and Chang Zhou and Jingren Zhou},
      year={2023},
      eprint={2308.07074},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}