JoelNiklaus / LawInstruct

This repository is a collection of legal instruction datasets
11 stars 3 forks source link

LawInstruct

This repository has code used to generate legal instruction datasets.

How to add a new dataset

  1. If there is no public Hugging Face repo for the dataset, take the raw data and upload it to the Hugging Face hub (in a private repo if the data is not permissively licensed)
  2. Add a class to the folder instruction_datasets that inherits from AbstractDataset and implements the abstract method get_data. The get_data method should yield datapoints with the following fields:
    • "instruction_language": the language of the instruction
    • "prompt_language": the language of the prompt
    • "answer_language": the language of the answer
    • "instruction": the instruction telling the model what to do
    • "prompt": the prompt input to the model
    • "answer": the answer providing the solution
    • "task_type": the type of task (e.g. "summarization")
    • "jurisdiction": the jurisdiction of the example (e.g. "US")
    • "subset": the subset of the dataset (e.g. "swiss_judgment_prediction" for "lextreme")
  3. Write one to 10 seed instructions to the en.json file for the new class
  4. Add the dataset to the list in lawinstruct_datasets.py
  5. To generate the dataset run the command
    python build_instruction_datasets.py --datasets ExampleDatasetName1 ExampleDatasetName2 --build_from_scratch

Setup

Install the requirements from requirements.txt. Make sure to have python 3.10 or higher. Make sure you have git and git-lfs installed.

On the ubelix slurm system, load the module with module load git-lfs/2.4.2 Run git lfs install to install git-lfs.

Clone the lawinstruct_raw repository locally:

git clone https://huggingface.co/datasets/lawinstruct/lawinstruct_raw

Clone the natural instructions data there too

git clone https://github.com/allenai/natural-instructions lawinstruct_raw/raw_data/ni_instructions_data

The en.json file was created by writing one to 5 seed instructions. Using GPT4, we generated paraphrases for each task. We used the following prompt: "Below is a list of instructions for a large language model. Expand this json to 10 paraphrases. Provide json as output. Keep the provided examples."

Possible improvements

Maybe later

Datasets possibly to be reconsidered later

Datasets to be added next

Datasets where we hit an obstacle

IR Datasets:

Summarization Datasets:

Other Datasets:

Troublehooting

Make sure to only yield from the same subset in the get_data() method. Otherwise, it will only write one example to the file and close it again.

References

Please cite the following preprint:

@misc{niklaus2024flawnt5,
      title={FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning}, 
      author={Joel Niklaus and Lucia Zheng and Arya D. McCarthy and Christopher Hahn and Brian M. Rosen and Peter Henderson and Daniel E. Ho and Garrett Honke and Percy Liang and Christopher Manning},
      year={2024},
      eprint={2404.02127},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}