MadryLab / DsDm

37 stars 2 forks source link

DsDm: Dataset Selection with Datamodels

[blog post] [paper] [install] [datasets] [select data]
By: Logan Engstrom, Axel Feldmann, Aleksander Madry

DsDm is a model-aware dataset selection method that can greatly improve downstream model performance...

...see our paper for more details!

Installation

Install the python packages necessary:

git clone git@github.com:madrylab/DsDm.git
cd DsDm
pip install -r requirements.txt

Datasets

We list instructions on how to both (a) load our candidate dataset and (b) select with each studied selection method (DsDm and baselines).

Candidate dataset

The candidate dataset we select with is available on Hugging Face. It is a tokenized version of the C4 en.noblocklist split prepared by AllenAI (see Appendix A.1 of our work for more details); each example is 1024 tokens.

To load the dataset and display a slice:

from dsdm import selections, utils

# load dataset and tokenizer
# (WARNING: this will download a 400GB dataset)
ds = selections.get_candidate_dataset()
tokenizer = utils.tokenizer_maker()

# display the first example in text form
text = tokenizer.decode(ds[0])

Loading selections

We provide selections for five methods (dsdm, classifier, dsir, random, and semdedup) and six target tasks (jeopardy, squad, lambada, cs_algorithms, lm_task_mix, gpt3_mix). Below, we describe how to load these selections.

Download dependencies

Loading selections requires some setup. First, install git lfs. Then, pull all the required metadata files:

git lfs fetch --all

Selecting data

Then load the selections:

from dsdm import selections, utils

# targeted methods: dsdm, classifier, dsir
method = "dsdm"
target = "squad"
num_examples = 100_000
indices = get_indices(method, num_examples, target)

# untargeted methods: semdedup, random
method = 'semdedup'
num_examples = 100_000
indices = get_indices(method, num_examples, target)

# select a subset
ds = selections.get_candidate_dataset()
selected_ds = ds.select(indices)

Selecting data

🚧 Coming soon! 🏗️