State-of-the-art Information Extraction
Tested with Python 3.8.
Download and install Sherlock:
git clone git@github.com:DFKI-NLP/sherlock.git
cd sherlock
pip install .
The most straighforward approach to use and test a model is using allennlp:
To train, use the AllenNLP CLI. This requires you to setup a config file. This project includes two example configurations in the configs folder:
To train them you can train the models via:
# transformer
allennlp train configs/binary_rc/transformer.jsonnet -f -s <serialization dir>
# cnn
allennlp train configs/binary_rc/cnn.jsonnet -f -s <serialization dir>
To evaluate the model it is expected that you model.tar.gz file in the archive format from AllenNLP. Now you have two options:
# evaluation script
python ./scripts/eval_binary_relation_clf_allennlp.py \
--eval_data_path <PATH TO EVAL DATA> \
--test_data_path <PATH TO TEST DATA> \
--do_eval \
--do_predict \
--eval_all_checkpoints \
--per_gpu_batch_size 8 \
--output_dir <SERIALIZATION DIR or PATH TO ARCHIVE> \
--overwrite_results
# allennlp cli
allennlp evaluate <PATH TO ARCHIVE> <PATH TO EVAL DATA> \
--cuda-device 0 \
--batch-size 8 \
The crux of the configs lies in the dataset_reader
and model
section.
The dataset_reader for AllenNLP is a patch-together of the dataset_reader from sherlock and the feature_converter from sherlock.
It inherits from allennlp.data.DatasetReader
and its name ("type") is "sherlock"
. It accepts a dataset_reader_name,
which must be a registered sherlock-dataset_reader and dataset_reader_kwargs to initialize the dataset_reader with correct arguments.
The same happens for the feature_converter
. Besides that, it takes
the standart arguments that a AllenNLP-DatasetReader takes.
For more details look into the documentation of the sherlock_dataset_reader.
The models directory contains the models which can be used as of now. Because of dependency-injection you can produce quite a lot with these models already: whereby the transformer model is limited to a certain type of (bert-like) transformers, the basic_relation_classifier can handle anything which fits into the schema of "embedder" -> "encoder" -> "classifier" (yes, theoretically transformer based models too).
For the transformers module it is important to give it the correct tokenizer keyword arguments, in this case additional_special_tokens
, as it uses those to rescale its embedding dimension. There did not seem another generic and clean way to do this.
The original repo was written only with the transformers
library support.
Although it is possible to use transformers
models via AllenNLP, Sherlock v2
still supports using the older codebase:
For example, to train a NER model on the TACRED dataset:
./scripts/run_ner.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_predict \
--evaluate_during_training \
--eval_all_checkpoints \
--do_lower_case \
--data_dir <TACRED DIR> \
--save_steps 8500 \
--logging_steps 8500 \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 5.0 \
--overwrite_cache \
--overwrite_output_dir \
--output_dir <OUTPUT DIR> \
--cache_dir <CACHE DIR>
For example, to train a RC model on the TACRED dataset:
./scripts/run_binary_relation_clf.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--do_train \
--do_eval \
--do_predict \
--evaluate_during_training \
--eval_all_checkpoints \
--do_lower_case \
--data_dir <TACRED DIR> \
--save_steps 8500 \
--logging_steps 8500 \
--max_seq_length 128 \
--per_gpu_eval_batch_size=8 \
--per_gpu_train_batch_size=8 \
--learning_rate 2e-5 \
--num_train_epochs 5.0 \
--overwrite_cache \
--overwrite_output_dir \
--entity_handling mark_entity_append_ner \
--output_dir <OUTPUT DIR> \
--cache_dir <CACHE DIR>
Tests are located in the directory tests
. To run them, being in the root directory call:
py.test
or
pytest -sv
To call a specific test specify testfile and use -k
flag:
pytest tests/feature_converters/token_classification_test.py -sv -k "truncate"
python==3.9
the installation of tokenizers
needed for transformers
may fails. Install Rust manually:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
link1,
link2conda>=4.10
the installation of jsonnet
may fail.
Install it manually: conda install -c conda-forge jsonnet