Implements the model described in the paper Entity, Relation, and Event Extraction with Contextualized Span Representations.
See the doc
folder for documentation with more details on the data, model implementation and debugging, and model configuration.
October 2023: Unfortunately, AllenNLP (on which DyGIE++ is built) has been archived and is not actively maintained. Due to changes to various software packages, the unavailability of older versions, following the instructions under dependencies now raises errors when trying to install DyGIE++. I don't have bandwidth to get things updated. I'd welcome a PR to update the relevant dependencies and get things working again! See the dependencies section for more info.
December 2021: A couple nice additions thanks to PR's from contributors:
April 2021: We've added data and models for the MECHANIC dataset, presented in the NAACL 2021 paper Extracting a Knowledge Base of Mechanisms from COVID-19 Papers.
You can also get the data by running bash scripts/data/get_mechanic.sh
, which will put the data in data/mechanic
.
After moving the models to the pretrained
folder, you can make predictions like this:
allennlp predict \
pretrained/mechanic-coarse.tar.gz \
data/mechanic/coarse/test.json \
--predictor dygie \
--include-package dygie \
--use-dataset-reader \
--output-file predictions/covid-coarse.jsonl \
--cuda-device 0 \
--silent
This branch used to be named allennlp-v1
, and it has been made the new master
. It's compatible with new version of AllenNLP, and the model configuration process has been simplified. I'd recommend using this branch for all future work. If for some reason you need the older version of the code, it's on the branch emnlp-2019.
Unfortunately, I don't have the bandwidth at this point to add additional features. But please create a new issue if you have problems with:
See below for guidelines on creating an issue.
There are a number of ways this code could be improved, and I'd definitely welcome pull requests. If you're interested, see contributions.md for a list of ieas.
If you have a DyGIE model that you've trained on a new dataset, feel free to upload it here and I'll add it to the collection of pre-trained models.
If you're unable to run the code, feel free to create an issue. Please do the following:
Specify any commands you used to download pretrained models or to download / preprocess data. Please enclose the code in code blocks, for instance:
# Download pretrained models.
bash scripts/pretrained/get_dygiepp_pretrained.sh
allennlp evaluate \
pretrained/scierc.tar.gz \
data/scierc/normalized_data/json/test.json \
--cuda-device 2 \
--include-package dygie
.jsonl
file.Update (October 2023): These directions no longer work. Python 3.7 is no longer available from conda
, and AllenNLP is no longer actively maintained, causing some dependencies to break. I'd welcome a PR to get things working again.
Clone this repository and navigate the the root of the repo on your system. Then execute:
conda create --name dygiepp python=3.7
pip install -r requirements.txt
conda develop . # Adds DyGIE to your PYTHONPATH
This library relies on AllenNLP and uses AllenNLP shell commands to kick off training, evaluation, and testing.
If you run into an issue installing jsonnet
, this issue may prove helpful.
A Dockerfile
is provided with the Pytorch + CUDA + CUDNN base image for a full-stack GPU install.
It will create conda environments dygiepp
for modeling & ace-event-preprocess
for ACE05-Event preprocessing.
By default the build downloads datasets and dependencies for all tasks. This takes a long time and produces a large image, so you will want to comment out unneeded datasets/tasks in the Dockerfile.
Dockerfile
.docker build --tag dygiepp:dev <dygiepp-repo-dirpath>
docker run --gpus all -it --ipc=host -v <dygiepp-repo-dirpath>:/dygiepp/ --name dygiepp dygiep:dev
NOTE: This Dockerfile was added in a PR from a contributor. I haven't tested it, so it's not "officially supported". More PR's are welcome, though.
Warning about coreference resolution: The coreference code will break on sentences with only a single token. If you have these in your dataset, either get rid of them or deactivate the coreference resolution part of the model.
We rely on Allennlp train to handle model training. The train
command takes a configuration file as an argument, and initializes a model based on the configuration, and serializes the traing model. More details on the configuration process for DyGIE can be found in doc/config.md.
To train a model, enter bash scripts/train.sh [config_name]
at the command line, where the config_name
is the name of a file in the training_config
directory. For instance, to train a model using the scierc.jsonnet
config, you'd enter
bash scripts/train.sh scierc
The resulting model will go in models/scierc
. For more information on how to modify training configs (e.g. to change the GPU used for training), see config.md.
Information on preparing specific training datasets is below. For more information on how to create training batches that utilize GPU resources efficiently, see model.md. Hyperparameter optimization search is implemented using Optuna, see model.md.
To train a model for named entity recognition, relation extraction, and coreference resolution on the SciERC dataset:
bash ./scripts/data/get_scierc.sh
. This will download the scierc dataset into a folder ./data/scierc
bash scripts/train.sh scierc
.bash scripts/train.sh scierc_lightweight
instead. More info on why you'd want to do this in the section on making predictions.The steps are similar to SciERC.
bash ./scripts/data/get_genia.sh
.bash scripts/train genia
.The ChemProt corpus contains entity and relation annotations for drug / protein interaction. The ChemProt preprocessing requires a separate environment:
conda deactivate
conda create --name chemprot-preprocess python=3.7
conda activate chemprot-preprocess
pip install -r scripts/data/chemprot/requirements.txt
Then, follow these steps:
Get the data.
bash ./scripts/data/get_chemprot.sh
. This will download the data and process it into the DyGIE input format.
mkdir -p data/chemprot/collated_data
python scripts/data/shared/collate.py \ data/chemprot/processed_data \ data/chemprot/collated_data \ --train_name=training \ --dev_name=development
python scripts/data/chemprot/03_spot_check.py
``` ```
bash scripts/train chemprot
.For more information on ACE relation and event preprocessing, see doc/data.md and this issue.
We use preprocessing code adapted from the DyGIE repo, which is in turn adapted from the LSTM-ER repo. The following software is required:
First, we need to download Stanford CoreNLP:
bash scripts/data/ace05/get_corenlp.sh
Then, run the driver script to preprocess the data:
bash scripts/data/get_ace05.sh [path-to-ACE-data]
The results will go in ./data/ace05/collated-data
. The intermediate files will go in ./data/ace05/raw-data
.
Enter bash scripts/train ace05_relation
. A model trained this way will not reproduce the numbers in the paper. We're in the process of debugging and will update.
The preprocessing code I wrote breaks with the newest version of Spacy. So unfortunately, we need to create a separate virtualenv that uses an old version of Spacy and use that for preprocessing.
conda deactivate
conda create --name ace-event-preprocess python=3.7
conda activate ace-event-preprocess
pip install -r scripts/data/ace-event/requirements.txt
python -m spacy download en_core_web_sm
Then, collect the relevant files from the ACE data distribution with
bash ./scripts/data/ace-event/collect_ace_event.sh [path-to-ACE-data].
The results will go in ./data/ace-event/raw-data
.
Now, run the script
python ./scripts/data/ace-event/parse_ace_event.py [output-name] [optional-flags]
You can see the available flags by calling parse_ace_event.py -h
. For detailed descriptions, see data.md. The results will go in ./data/ace-event/processed-data/[output-name]
. We require an output name because you may want to preprocess the ACE data multiple times using different flags. For default preprocessing settings, you could do:
python ./scripts/data/ace-event/parse_ace_event.py default-settings
Now conda deactivate
the ace-event-preprocess
environment and re-activate your modeling environment.
Finally, collate the version of the dataset you just created. For instance, continuing the example above,
mkdir -p data/ace-event/collated-data/default-settings/json
python scripts/data/shared/collate.py \
data/ace-event/processed-data/default-settings/json \
data/ace-event/collated-data/default-settings/json \
--file_extension json
To train on the data preprocessed with default settings, enter bash scripts/train.sh ace05_event
. A model trained in this fashion will reproduce (within 0.1 F1 or so) the results in Table 4 of the paper. To train on a different version, modify training_config/ace05_event.jsonnet
to point to the appropriate files.
To reproduce the results in Table 1 requires training an ensemble model of 4 trigger detectors. The basic process is as follows:
training_config/ace05_event.jsonnet
by setting
model +: {
modules +: {
events +: {
loss_weights: {
trigger: 1.0,
arguments: 0.5
}
}
}
}
models/ace05_event
.If you need more details, email me.
You can get the dataset by running bash scripts/data/get_mechanic.sh
. For detailed training instructions, see the DyGIE-COFIE repo.
To check the performance of one of your models or a pretrained model, you can use the allennlp evaluate
command.
Note that allennlp
commands will only be able to discover the code in this package if:
dygiepp
, or:conda develop .
from the root folder of this project.Otherwise, you will get an error ModuleNotFoundError: No module named 'dygie'
.
In general, you can make evaluate a model like this:
allennlp evaluate \
[model-file] \
[data-path] \
--cuda-device [cuda-device] \
--include-package dygie \
--output-file [output-file] # Optional; if not given, prints metrics to console.
For example, to evaluate the pretrained SciERC model, you could do
allennlp evaluate \
pretrained/scierc.tar.gz \
data/scierc/normalized_data/json/test.json \
--cuda-device 2 \
--include-package dygie
To evaluate a model you trained on the SciERC data, you could do
allennlp evaluate \
models/scierc/model.tar.gz \
data/scierc/normalized_data/json/test.json \
--cuda-device 2 \
--include-package dygie \
--output-file models/scierc/metrics_test.json
A number of models are available for download. They are named for the dataset they are trained on. "Lightweight" models are models trained on datasets for which coreference resolution annotations were available, but we didn't use them. This is "lightweight" because coreference resolution is expensive, since it requires predicting cross-sentence relationships between spans.
If you want to use one of these pretrained models to make predictions on a new dataset, you need to set the dataset
field for the instances in your new dataset to match the name of the dataset
the model was trained on. For example, to make predictions using the pretrained SciERC model, set the dataset
field in your new instances to scierc
. For more information on the dataset
field, see data.md.
To download all available models, run scripts/pretrained/get_dygiepp_pretrained.sh
. Or, click on the links below to download only a single model.
Below are links to the available models, followed by the name of the dataset
the model was trained on.
scierc
scierc
genia
genia
chemprot
ace05
ace-event
None
covid-event
DyGIE can now be called from Spacy! For example usage, see the demo notebook. This feature was added by a contributor; please tag @e3oroush on related issues.
SciERC
"_scierc__ner_f1": 0.6846741045214326,
"_scierc__relation_f1": 0.46236559139784944
SciERC lightweight
"_scierc__ner_f1": 0.6717245404143566,
"_scierc__relation_f1": 0.4670588235294118
GENIA
"_genia__ner_f1": 0.7713070807912737
GENIA lightweight And the lightweight version:
"_genia__ner_f1": 0.7690401296349251
ChemProt
"_chemprot__ner_f1": 0.9059113300492612,
"_chemprot__relation_f1": 0.5404867256637169
Note that we're doing span-level evaluation using predicted entities. We're also evaluating on all ChemProt relation classes, while the official task only evaluates on a subset (see Liu et al. for details). Thus, our relation extraction performance is lower than, for instance, Verga et al., where they use gold entities as inputs for relation prediction.
ACE05-Relation
"_ace05__ner_f1": 0.8634611855386309,
"_ace05__relation_f1": 0.6484907497565725,
ACE05-Event
"_ace-event__ner_f1": 0.8927209418006965,
"_ace-event_trig_class_f1": 0.6998813760379595,
"_ace-event_arg_class_f1": 0.5,
"_ace-event__relation_f1": 0.5514950166112956
To make a prediction, you can use allennlp predict
. For example, to make a prediction with the pretrained scierc model, you can do:
allennlp predict pretrained/scierc.tar.gz \
data/scierc/normalized_data/json/test.json \
--predictor dygie \
--include-package dygie \
--use-dataset-reader \
--output-file predictions/scierc-test.jsonl \
--cuda-device 0 \
--silent
The predictions include the predict labels, as well as logits and softmax scores. For more information see, docs/data.md.
Caveat: Models trained to predict coreference clusters need to make predictions on a whole document at once. This can cause memory issues. To get around this there are two options:
See the docs for more prediction options.
Following Li and Ji (2014), we consider a predicted relation to be correct if "its relation type is correct, and the head offsets of two entity mention arguments are both correct".
In particular, we do not require the types of the entity mention arguments to be correct, as is done in some work (e.g. Zhang et al. (2017)). We welcome a pull request that implements this alternative evaluation metric. Please open an issue if you're interested in this.
Follow the instructions as described in Formatting a new dataset.
To make predictions on a new, unlabeled dataset:
dataset
field for your new dataset matches the label namespaces for the pretrained model. See here for more on label namespaces. To view the available label namespaces for a pretrained model, use print_label_namespaces.py.allennlp predict pretrained/[name-of-pretrained-model].tar.gz \
[input-path] \
--predictor dygie \
--include-package dygie \
--use-dataset-reader \
--output-file [output-path] \
--cuda-device [cuda-device]
A couple tricks to make things run smoothly:
--overrides "{'dataset_reader' +: {'lazy': true}}"
jsonl
output, but they will have an additional field {"_FAILED_PREDICTION": true}
indicating that the model ran out of memory on this example.dataset
field in the dataset to be predicted must match one of the dataset
s on which the model was trained; otherwise, the model won't know which labels to apply to the predicted data.Follow the process described in Training a model, but adjusting the input and output file paths as appropriate.
For questions or problems with the code, create a GitHub issue (preferred) or email dwadden@cs.washington.edu
.