This repository is used for NLP at Colgate University, taught by Profs. Forrest Davis and Grusha Prasad. It is mainly a toolkit for running various NLP experiments with the aim of training NLP Scholars! It builds on top of the wonderful HuggingFace NLP tools.
Never heard of NLP Scholars before, no worries, we wrote a paper: Training an NLP Scholar at a Small Liberal Arts College: A Backwards Designed Course Proposal which we presented at the Sixth Workshop on Teaching NLP co-located with ACL 2024.
PLEASE NOTE: This is under active development. If you run into issues, we are sorry. First thank you so much for using this. We would love to learn that you are and how you incorporate it in your class/research. You can cite this in your work if you want (see the bottom) :). Second, please submit an issue on github. We will try to promptly fix it (if its not a crazy time of the semester). If you are in our class, please email us.
Now onto the details!
Install the nlp
environment (see Install.md). Ensure you have run
conda activate nlp
You can run experiments via main.py
with the relevant config file. For
example, you can try a sample config for interact:
python main.py sample_configs/interact.yaml
If no config is provided as an argument, config.yaml
is used. See below for
details on the structure of the config files.
Interacting with the toolkit is facilitated by config files. An example one is copied below showing how to run GPT2 in interactive mode.
exp: MinimalPair
mode:
- interact
models:
hf_causal_model:
- gpt2
exp
There are three experiments: MinimalPair
, TextClassification
, and
TokenClassification
. MinimalPair
is for by-token (that is, including
subword) predictability measures for minimal pair experiments (i.e., targeted
syntactic evaluations). TextClassification
is for classification over texts.
In other words, one label is returned for each text inputted (e.g., sentiment
analysis, natural language inference). TokenClassification
is for
classification over tokens in a text. In other words, one label is returned for
each token (i.e., subword) in a text (e.g., part of speech tagging, named entity
recognition).
mode
Each experiment has four mode
s: interact
, train
,
evaluate
, and analyze
.
Experiment | Interact | Train | Evaluate | Analyze |
---|---|---|---|---|
MinimalPair | Returns by-word predictability for an inputted sentence (combining subword tokens) | Finetunes LM using relevant objective (e.g., autoregressive, MLM) | Returns by-token predictability measures for each input in a dataset | Returns average difference in predictability between expected and unexpected for each condition and model |
TextClassification | Returns by-text classification labels for some inputted text | Finetunes a pretrained model for text classification | Returns by-text labels for each input in a dataset | Returns average accuracy, precision and recall for each class in each condition |
TokenClassification | Returns by-token classification labels for an inputted text | Finetunes a pretrained model for token classification | Returns by-token labels for each input in a dataset | Returns average accuracy, precision and recall for each class in each condition |
models
There are 4 types of models supported (all building on HuggingFace's
transformers library): causal language models (hf_causal_model
), masked
language models (hf_masked_model
), text classification models
(hf_text_classification_model
), and token classification models
(hf_token_classification_model
). You can use any model that is listed on
HuggingFace's models hub as long as they work with the relevant auto classes
(AutoModelForCausalLM
, AutoModelForMaskedLM
,
AutoModelForSequenceClassification
, AutoModelForTokenClassification
). To
refer to a model, you give it's type and then it's model name on HuggingFace (or
path to the folder on your local computer). More than one model from each class
can be specified. For example,
models:
hf_causal_model:
- gpt2
hf_masked_model:
- bert-base-cased
- roberta-base
Will load one causal language model (GPT2) and two masked language models (BERT and RoBERTa).
interact
The interact
mode builds on the evaluate
mode and allows you to interact
with a model before running it. You simply specify a model (and any additional
config settings):
exp: MinimalPair
mode:
- interact
models:
hf_masked_model:
- bert-base-cased
evaluate
datafpath
and predfpath
In running in evaluate
mode, you need to specify the path to the file with the
data you will be running the model on (the datafpath
) and the path to where
you want to save the predictions (the predfpath
). For example,
exp: MinimalPair
mode:
- evaluate
models:
hf_causal_model:
- gpt2
hf_masked_model:
- bert-base-cased
datafpath: data/minimal_pairs.tsv
predfpath: predictions/minimal_pairs.tsv
Runs GPT2 and BERT on the minimal pairs in data/minimal_pairs
and saves the
output to predictions/minimal_pairs.tsv
.
checkFileColumns
In evaluating models (in evaluate
mode), you can check that the necessary
columns for the broader experiment are included in the file indexed with
datafpath
. The default value is True
, meaning the columns are checked.
checkFileColumns: True
loadAll
In evaluating models (in evaluate
mode), more than one model can be evaluated.
loadAll
controls memory usage, with True
for loading all models to memory at
once and False
for loading one model at a time into memory. The default
behavior is False
.
loadAll: False
stride
When using models with a fixed context length (like gpt2
), care needs to be
taken with calculating the predictability measures of a token in long contexts
which exceed the maximum length of the model. In these cases, we use a striding
window strategy. You can use stride
to control the size of the stride. The
default behavior is half of the maximum length allowed by the model.
stride: 100
To make this clear, here is an example. Suppose the maximum context length was 5, each word was represented as one token, and our stride was 3. The following sentence would cause problems:
the strange boy is outside and the girl saw him
As it has 10 words, but the model only allows for 5. Using a stride would yield the following fragments, over which predictability measures are calculated:
the strange boy is outside
is outside and the girl
the girl saw him
The first occurrence of the word is where it's probability is calculated. So for
example, outside
is P(outside | the strange boy is)
and girl
is P(girl | is outside and the)
.
Note that for certain models, special tokens are appended to the start and end of a text (like [CLS]). At the moment, we handle this just for masked language models. Importantly, these additional tokens interact with stride, as we have implemented it. They are ignored in calculating the stride jumps and maximum context length. So, in effect, the maximum length is less than the total allowed (by two for now, though future versions should handle this more rigorously).
analyze
The analyze
mode is designed to take the predictions from the evaluate
mode and generate summaries that are relevant to each experiment type. More details can be found in the following markdown files in src/analysis/
: MinimalPairAnalysis.md
, TextClassificationAnalysis.md
, and TokenClassificationAnalysis.md
.
In the analyze
mode, you need to specify three filepaths:
predfpath
: the filepath to the predictions generated by the evaluate
mode for any experiment. datafpath
: the filepath that maps each input in the evaluation dataset to the specific condition it belongs to. resultsfpath
: the filepath where the summary output should be saved. Each experiment type has several other optional parameters, the details of which can be found in the experiment specific markdown files in src/analysis/
.
train
trainfpath
, validfpath
, and modelfpath
In train
mode you need to specify training data, validation data, and an
output directory for the final model. Training and validation data can either be
from HuggingFace's dataset options on their hub or local json or tsv files. To
specify a HuggingFace remote dataset specify the name, task (if applicable), and
split with colon seperators. See below for two examples, one loading
imdb data, which doesn't
have subtasks:
trainfpath: imdb:train
validfpath: imdb:test
modelfpath: imdb_model
and one loading mnli from glue:
trainfpath: nyu-mll/glue:mnli:train
validfpath: nyu-mll/glue:mnli:validation_matched
modelfpath: mnli_model
loadPretrained
You can specify if you want to load the pretrained weights of a model or
randomly initialize a model with the same architecture as the model name with
loadPretrained
. If set to False
and training for classification you must
specify the number of labels with numLabels
. The default is True
.
loadPretrained: True
numLabels
You can specify the number of classification labels for token or text
classification with numLabels
. Note: you must specify a value if
loadPretrained
is False
.
numLabels: 5
maxTrainSequenceLength
In loading a model, you can specify the maximum sequence length (i.e., the
context size) with maxTrainSequenceLength
. This changes the sequence length of the
model when loading not from a pretrained model and controls the sequence length
in a batch during training. The default is 128.
maxTrainSequenceLength: 128
seed
You can specify the seed for shuffling the dataset initially with seed
. The
default value is 23.
seed: 23
samplePercent
It is often helpful to run training code with a subset of your data (e.g., for
debugging). You can specify what percent of your data to use with
samplePercent
. This can either be a whole number between 0 and 100 (which is
converted to a percent; for example, 10 translates to 0.10), or a float (e.g.,
0.001). The default is None, which results in no sampling (i.e., all data is
used).
samplePercent: 10
textLabel
For training either with language model objectives or text/token classification,
your data needs to point to the text. The column with this information is
specified with textLabel
. The default is "text"
which means your text data
for training should be labeled with a column with that name.
textLabel: text
pairLabel
For text classification, you can specify two sentences for tasks like natural
language inference and paraphrase detection. You can specify the column with the
second sentence using pairLabel
. The default is "pair"
. The example below
shows how to specify the correct columns for glue's mnli task.
textLabel: premise
pairLabel: hypothesis
tokensLabel
For token classification, the dataset must also provide the tokens (e.g., the
words). You can specify the column with this data with tokensLabel
. The
default is "tokens"
.
tokensLabel: tokens
tagsLabel
For token classification, the dataset must also provide the per-token tags
(e.g., the named-entity tags). You can specify the column with this data with
tagsLabel
. The default is "tags"
.
tagsLabel: tags
You can specify the following additional training settings using the config file (lightly adapted from HuggingFace's Trainer arguments and Trainer class):
modelfpath: The output directory where the model is saved
epochs: Number of training epochs. The default is 2.
eval_strategy: The evaluation strategy to use ('no' | 'steps'
| 'epoch'). The default is 'epoch'.
eval_steps: Number of update steps between two evaluations.
The default is 500.
batchSize: The per-device batch size for train/eval. The
default is 8.
learning_rate: The initial learning rate for AdamW. Default
is 5e-5.
weight_decay: Weight decay. The default is 0.01.
save_strategy: The checkpoint save strategy to use ('no'
| 'steps' | 'epoch'). The default is 'epoch'.
save_steps: Number of update steps between saves. Default is
500.
load_best_model_at_end: Whether or not to load the best model
at the end of training. If True, the best model will
be saved as the model at the end. Default is False.
maskProbability: The rate of dynamic masking for masked
language modeling. The default is 0.15.
There are further parameters that can be specified, detailed below.
device
You can specify the device you want to run the model on with device
:
device: mps
The options are best
, mps
, cuda
, or cpu
(specific devices can also be
specified). The default is best
, which prioritizes mps
and then cpu
. (You
must specify cuda
or similar to use GPUs).
precision
You can control the memory requirements for your experiments with precision
,
which controls the precision of the loaded model (if applicable):
precision: 16bit
The options are full
, 16bit
, 8bit
, and 4bit
. The default behavior is
full
, which loads the model without changing its precision. Selecting 16bit
with train
will train a lower precision model (note: you need to use a GPU for
this).
PLL_type
PLL_type: original
For masked language models, the predictability measures are gathered by
iteratively masking each token in the input. Following, Kauf and Ivanova
(2023), this 'pseudo-likelihood'
can be determined via different masking schemes. Presently we have two options,
original
which simply masks each token in the input one by one, following
Salazar et al. (2020), and
within_word_l2r
which handles words that are subworded by masking each subword
token to the right of the word being predicted. The default is within_word_l2r
as Kauf and Ivanova find it performs better.
id2label
For classification models, it can be helpful to specify a mapping from model labels to identifiable. You can do this by specifying the mappings, as below:
id2label:
1: Positive
0: Negative
This maps the output index 1 to Positive and 0 to Negative. The default behavior is to use the models id2label attribute.
tokenizers
You can specify the specific tokenizers you want to use with tokenizers
:
models:
hf_causal_model:
- gpt2-medium
tokenizers:
hf_tokenizer:
- gpt2
At the moment, the only supported tokenizers are those using HuggingFace's
AutoTokenizer
class. Note that the tokenizers and models are aligned lazily.
The assumption is that for each model listed in models, the tokenizer is in the
same position in the tokenizer list. For example,
models:
hf_causal_model:
- gpt2
hf_masked_model:
- bert-base-uncased
tokenizers:
hf_tokenizer:
- bert-base-uncased
- gpt2
Loads GPT2 with BERT's tokenizer and BERT with GPT2's tokenizer. The default behavior is to load the version of the tokenizer associated with the model. The default behavior is strongly encouraged.
doLower
The input text can be lowercased with doLower
, which takes a boolean. If set
to True
the text is lowercased (with special tokens preserved). The default
behavior is False
. Note this does not override the tokenizer's default
behavior. For example, if you do not set doLower
to True
but use
bert-base-uncased
, the text will still be lowercased.
doLower: True
addPrefixSpace
Some tokenizers include white space in their tokens (e.g., Byte-based tokenizers
as in GPT2). For these tokenizers, the presence of a initial space affects
tokenization. You can explicitly set this behavior with addPrefixSpace
, which
will add a prefix space if set to True
. The default behavior is to not add a
space (i.e., False
), following other libraries like minicons.
addPrefixSpace: True
addPadToken
For batched input, it is important to have a pad token set. This can be done
with addPadToken
. You can either put True
or eos_token
if you want to use
the model's eos token or give a specific token (e.g., PAD). Note, only a word
with a single token id (i.e., a word not sub-worded) can be used as a pad token.
An error will be thrown if this is not the case. Note, this will not override
the model's pad token if one already exists. The default behavior is to use
eos, if no pad token exists. If none is provided and their is no eos token in
the model, an error may be thrown during tokenization.
addPadToken: True
batchSize
In experiments, you can control the batch size of the model with batchSize
.
The default size is 1 in evaluate
mode and 16 in train
mode.
batchSize: 1
verbose
In running experiments, you can control verbosity with verbose
. When set to
True
, it prints some more information to the screen. The default behavior is
True
.
verbose: True
Please cite our NLP Scholar paper:
@inproceedings{prasad-davis-2024-training-nlp,
title = {Training an {NLP} Scholar at a Small Liberal Arts College: A Backwards Designed Course Proposal},
author = {Prasad, Grusha and Davis, Forrest},
editor = {Al-azzawi, Sana and Biester, Laura and Kov{\'a}cs, Gy{\"o}rgy and Marasovi{\'c}, Ana and Mathur, Leena and Mieskes, Margot and Weissweiler, Leonie},
booktitle = {Proceedings of the Sixth Workshop on Teaching NLP},
month = aug,
year = {2024},
address = {Bangkok, Thailand},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2024.teachingnlp-1.16},
pages = {105--118},
}
This wouldn't work without HuggingFace! Please cite them too:
@inproceedings{wolf-etal-2020-transformers,
title = "Transformers: State-of-the-Art Natural Language Processing",
author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
pages = "38--45"
}
@inproceedings{lhoest-etal-2021-datasets,
title = "Datasets: A Community Library for Natural Language Processing",
author = "Lhoest, Quentin and
Villanova del Moral, Albert and
Jernite, Yacine and
Thakur, Abhishek and
von Platen, Patrick and
Patil, Suraj and
Chaumond, Julien and
Drame, Mariama and
Plu, Julien and
Tunstall, Lewis and
Davison, Joe and
{\v{S}}a{\v{s}}ko, Mario and
Chhablani, Gunjan and
Malik, Bhavitvya and
Brandeis, Simon and
Le Scao, Teven and
Sanh, Victor and
Xu, Canwen and
Patry, Nicolas and
McMillan-Major, Angelina and
Schmid, Philipp and
Gugger, Sylvain and
Delangue, Cl{\'e}ment and
Matussi{\`e}re, Th{\'e}o and
Debut, Lysandre and
Bekman, Stas and
Cistac, Pierric and
Goehringer, Thibault and
Mustar, Victor and
Lagunas, Fran{\c{c}}ois and
Rush, Alexander and
Wolf, Thomas",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-demo.21",
pages = "175--184",
abstract = "The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https://github.com/huggingface/datasets.",
eprint={2109.02846},
archivePrefix={arXiv},
primaryClass={cs.CL},
}
If you use Kauf and Ivanova's PPL scoring technique, please cite them:
@inproceedings{kauf-ivanova-2023-better,
title = "A Better Way to Do Masked Language Model Scoring",
author = "Kauf, Carina and
Ivanova, Anna",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-short.80",
doi = "10.18653/v1/2023.acl-short.80",
pages = "925--935",
abstract = "Estimating the log-likelihood of a given sentence under an autoregressive language model is straightforward: one can simply apply the chain rule and sum the log-likelihood values for each successive token. However, for masked language models (MLMs), there is no direct way to estimate the log-likelihood of a sentence. To address this issue, Salazar et al. (2020) propose to estimate sentence pseudo-log-likelihood (PLL) scores, computed by successively masking each sentence token, retrieving its score using the rest of the sentence as context, and summing the resulting values. Here, we demonstrate that the original PLL method yields inflated scores for out-of-vocabulary words and propose an adapted metric, in which we mask not only the target token, but also all within-word tokens to the right of the target. We show that our adapted metric (PLL-word-l2r) outperforms both the original PLL metric and a PLL metric in which all within-word tokens are masked. In particular, it better satisfies theoretical desiderata and better correlates with scores from autoregressive models. Finally, we show that the choice of metric affects even tightly controlled, minimal pair evaluation benchmarks (such as BLiMP), underscoring the importance of selecting an appropriate scoring metric for evaluating MLM properties.",
}
If you like this toolkit and want more tools or different tools, check out the following repositories: