This repository has code used to generate legal instruction datasets.
instruction_datasets
that inherits from AbstractDataset
and implements the abstract
method get_data
. The get_data
method should yield datapoints with the following fields:
lawinstruct_datasets.py
python build_instruction_datasets.py --datasets ExampleDatasetName1 ExampleDatasetName2 --build_from_scratch
Install the requirements from requirements.txt
. Make sure to have python 3.10 or higher.
Make sure you have git and git-lfs installed.
On the ubelix slurm system, load the module with module load git-lfs/2.4.2
Run git lfs install
to install git-lfs.
Clone the lawinstruct_raw repository locally:
git clone https://huggingface.co/datasets/lawinstruct/lawinstruct_raw
Clone the natural instructions data there too
git clone https://github.com/allenai/natural-instructions lawinstruct_raw/raw_data/ni_instructions_data
The en.json file was created by writing one to 5 seed instructions. Using GPT4, we generated paraphrases for each task. We used the following prompt: "Below is a list of instructions for a large language model. Expand this json to 10 paraphrases. Provide json as output. Keep the provided examples."
wget
failed.Make sure to only yield from the same subset in the get_data()
method. Otherwise, it will only write one example to
the file and close it again.
Please cite the following preprint:
@misc{niklaus2024flawnt5,
title={FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning},
author={Joel Niklaus and Lucia Zheng and Arya D. McCarthy and Christopher Hahn and Brian M. Rosen and Peter Henderson and Daniel E. Ho and Garrett Honke and Percy Liang and Christopher Manning},
year={2024},
eprint={2404.02127},
archivePrefix={arXiv},
primaryClass={cs.CL}
}