KnowBert is a general method to embed multiple knowledge bases into BERT. This repository contains pretrained models, evaluation and training scripts for KnowBert with Wikipedia and WordNet.
Citation:
@inproceedings{Peters2019KnowledgeEC,
author={Matthew E. Peters and Mark Neumann and Robert L Logan and Roy Schwartz and Vidur Joshi and Sameer Singh and Noah A. Smith},
title={Knowledge Enhanced Contextual Word Representations},
booktitle={EMNLP},
year={2019}
}
git clone git@github.com:allenai/kb.git
cd kb
conda create -n knowbert python=3.6.7
source activate knowbert
pip install torch==1.2.0
pip install -r requirements.txt
python -c "import nltk; nltk.download('wordnet')"
python -m spacy download en_core_web_sm
pip install --editable .
Then make sure the tests pass:
pytest -v tests
from kb.include_all import ModelArchiveFromParams
from kb.knowbert_utils import KnowBertBatchifier
from allennlp.common import Params
import torch
# a pretrained model, e.g. for Wordnet+Wikipedia
archive_file = 'https://allennlp.s3-us-west-2.amazonaws.com/knowbert/models/knowbert_wiki_wordnet_model.tar.gz'
# load model and batcher
params = Params({"archive_file": archive_file})
model = ModelArchiveFromParams.from_params(params=params)
batcher = KnowBertBatchifier(archive_file)
sentences = ["Paris is located in France.", "KnowBert is a knowledge enhanced BERT"]
# batcher takes raw untokenized sentences
# and yields batches of tensors needed to run KnowBert
for batch in batcher.iter_batches(sentences, verbose=True):
# model_output['contextual_embeddings'] is (batch_size, seq_len, embed_dim) tensor of top layer activations
model_output = model(**batch)
First download one of the pretrained models from the previous section.
Download the heldout data. Then run:
MODEL_ARCHIVE=..location of model
HELDOUT_FILE=wikipedia_bookscorpus_knowbert_heldout.txt
python bin/evaluate_perplexity.py -m $MODEL_ARCHIVE -e $HELDOUT_FILE
The heldout perplexity is key exp(lm_loss_wgt)
.
Run:
MODEL_ARCHIVE=..location of model
mkdir -p kg_probe
cd kg_probe
curl https://allennlp.s3-us-west-2.amazonaws.com/knowbert/data/kg_probe.zip > kg_probe.zip
unzip kg_probe.zip
cd ..
python bin/evaluate_mrr.py \
--model_archive $MODEL_ARCHIVE \
--datadir kg_probe \
--cuda_device 0
The results are in key 'mrr'
.
To evaluate the internal WordNet linker on the ALL task evaluation from Raganato et al. (2017) follow these steps (Table 2). First download the Java scorer and evaluation file.
Then run this command to generate predictions from KnowBert:
EVALUATION_FILE=semeval2007_semeval2013_semeval2015_senseval2_senseval3_all.json
KNOWBERT_PREDICTIONS=knowbert_wordnet_predicted.txt
MODEL_ARCHIVE=..location of model
python bin/evaluate_wsd_official.py \
--evaluation_file $EVALUATION_FILE \
--output_file $KNOWBERT_PREDICTIONS \
--model_archive $MODEL_ARCHIVE \
--cuda_device 0
To evaluate predictions, decompress the Java scorer, navigate to the directory WSD_Evaluation_Framework/Evaluation_Datasets
and run
java Scorer ALL/ALL.gold.key.txt $KNOWBERT_PREDICTIONS
To reproduce the results in Table 3 for KnowBert-W+W:
# or aida_test.txt
EVALUATION_FILE=aida_dev.txt
MODEL_ARCHIVE=..location of model
curl https://allennlp.s3-us-west-2.amazonaws.com/knowbert/wiki_entity_linking/$EVALUATION_FILE > $EVALUATION_FILE
python bin/evaluate_wiki_linking.py \
--model_archive $MODEL_ARCHIVE \
--evaluation_file $EVALUATION_FILE \
--wiki_and_wordnet
Results are in key wiki_el_f1
.
Fine tuning KnowBert is similar to fine tuning BERT for a downstream task. We provide configuration and model files for the following tasks:
To reproduce our results for the following tasks, find the appropriate config
file in training_config/downstream/
, edit the location of the training and dev
data files, then run (example provided for TACRED):
allennlp train --file-friendly-logging --include-package kb.include_all \
training_config/downstream/tacred.jsonnet -s OUTPUT_DIRECTORY
Similar to BERT, for some tasks performance can vary significantly with hyperparameter
choices and the random seed. We used the script bin/run_hyperparameter_seeds.sh
to perform a small grid search over learning rate, number of epochs and the random seed,
choosing the best model based on the validation set.
Fine-tuned KnowBert-Wiki+Wordnet models are available.
To evaluate a model first download the model archive and run:
allennlp evaluate --include-package kb.include_all \
--cuda-device 0 \
model_archive_here \
dev_or_test_filename_here
To evaluate a model with the official scorer, run:
python bin/write_tacred_for_official_scorer.py \
--model_archive model_archive_here \
--evaluation_file tacred_dev_or_test.json \
--output_file knowbert_predictions_tacred_dev_or_test.txt
python bin/tacred_scorer.py tacred_dev_or_test.gold knowbert_predictions_tacred_dev_or_test.txt
To evaluate a model with the official scorer, first download the testing gold keys and run:
curl https://allennlp.s3-us-west-2.amazonaws.com/knowbert/data/semeval2010_task8/test.json > semeval2010_task8_test.json
python bin/write_semeval2010_task8_for_official_eval.py \
--model_archive model_archive_here \
--evaluation_file semeval2010_task8_test.json \
--output_file knowbert_predictions_semeval2010_task8_test.txt
perl -w bin/semeval2010_task8_scorer-v1.2.pl knowbert_predictions_semeval2010_task8_test.txt semeval2010_task8_testing_keys.txt
Use bin/write_wic_for_codalab.py
to write a file for submission to the CodaLab evaluation server.
Roughly speaking, the process to fine tune BERT into KnowBert is:
bin/create_pretraining_data_for_bert.py
to group the sentences by length, do the NSP sampling, and write out files for training.We have already prepared the knowledge bases for Wikipedia and WordNet. The necessary files will be automatically downloaded as needed when running evaluations or fine tuning KnowBert.
If you would like to add an additional knowledge source to KnowBert, these are roughly the steps to follow:
Our Wikipedia candidate dictionary list and embeddings were extracted from End-to-End Neural Entity Linking, Kolitsas et al 2018 via a manual process.
Our WordNet candidate generator is rule based (see code). The embeddings were computed via a multistep process that combines TuckER and GenSen embeddings. The prepared files contain everything needed to run KnowBert and include:
entities.jsonl
- metadata about WordNet synsets.wordnet_synsets_mask_null_vocab.txt
and wordnet_synsets_mask_null_vocab_embeddings_tucker_gensen.hdf5
- vocabulary file and embedding file for WordNet synsets.semcor_and_wordnet_examples.json
annotated training data combining SemCor and WordNet examples for supervising the WordNet linker.If you would like to generate these files yourself from scratch, follow these steps.
python bin/extract_wordnet.py --extract_graph --entity_file $WORKDIR/entities.jsonl --relationship_file $WORKDIR/relations.txt
WORKDIR=.
cd $WORKDIR
wget https://pilehvar.github.io/wic/package/WiC_dataset.zip
unzip WiC_dataset.zip
cd $WORKDIR
wget http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip
unzip WSD_Evaluation_Framework.zip
mkdir $WORKDIR/wsd_jsonl
python bin/preprocess_wsd.py --wsd_framework_root $WORKDIR/WSD_Evaluation_Framework --outdir $WORKDIR/wsd_jsonl
cat $WORKDIR/wsd_jsonl/semeval* $WORKDIR/wsd_jsonl/senseval* > $WORKDIR/semeval2007_semeval2013_semeval2015_senseval2_senseval3.json
python bin/extract_wordnet.py --extract_examples_wordnet --entity_file $WORKDIR/entities.jsonl --wic_root_dir $WORKDIR --wordnet_example_file $WORKDIR/wordnet_examples_remove_wic_devtest.json
cat $WORKDIR/wordnet_examples_remove_wic_devtest.json $WORKDIR/wsd_jsonl/semcor.json > $WORKDIR/semcor_and_wordnet_examples.json
python bin/extract_wordnet.py --split_wordnet --relationship_file $WORKDIR/relations.txt --relationship_train_file $WORKDIR/relations_train99.txt --relationship_dev_file $WORKDIR/relations_dev01.txt
allennlp train -s $WORKDIR/wordnet_tucker --include-package kb.kg_embedding --file-friendly-logging training_config/wordnet_tucker.json
python bin/combine_wordnet_embeddings.py --generate_wordnet_synset_vocab --entity_file $WORKDIR/entities.jsonl --vocab_file $WORKDIR/wordnet_synsets_mask_null_vocab.txt
python bin/combine_wordnet_embeddings.py --generate_gensen_embeddings --entity_file $WORKDIR/entities.jsonl --vocab_file $WORKDIR/wordnet_synsets_mask_null_vocab.txt --gensen_file $WORKDIR/gensen_synsets.hdf5
python bin/combine_wordnet_embeddings.py --extract_tucker --tucker_archive_file $WORKDIR/wordnet_tucker/model.tar.gz --vocab_file $WORKDIR/wordnet_synsets_mask_null_vocab.txt --tucker_hdf5_file $WORKDIR/tucker_embeddings.hdf5
python bin/combine_wordnet_embeddings.py --combine_tucker_gensen --tucker_hdf5_file $WORKDIR/tucker_embeddings.hdf5 --gensen_file $WORKDIR/gensen_synsets.hdf5 --all_embeddings_file $WORKDIR/wordnet_synsets_mask_null_vocab_embeddings_tucker_gensen.hdf5
This step pretrains the entity linker while freezing the rest of the network using only supervised data.
Config files are in training_config/pretraining/knowbert_wiki_linker.jsonnet
and training_config/pretraining/knowbert_wordnet_linker.jsonnet
.
To train the Wikipedia linker for KnowBert-Wiki run:
allennlp train -s OUTPUT_DIRECTORY --file-friendly-logging --include-package kb.include_all training_config/pretraining/knowbert_wiki_linker.jsonnet
The command is similar for WordNet.
After pre-training the entity linkers from the step above, fine tune BERT.
The pretrained models in our paper were trained on a single GPU with 24GB of RAM. For multiple GPU training, change cuda_device
to a list of device IDs.
Config files are in training_config/pretraining/knowbert_wiki.jsonnet
and
training_config/pretraining/knowbert_wordnet.jsonnet
.
Before training, modify the following keys in the config file (or use --overrides
flag to allennlp train
):
"language_modeling"
"model_archive"
to point to the model.tar.gz
from the previous linker pretraining step.First train KnowBert-Wiki. Then pretrain the WordNet linker and finally fine tune the entire network.
Config file to pretrain the WordNet linker from KnowBert-Wiki is in training_config/pretraining/knowbert_wordnet_wiki_linker.jsonnet
and config to train KnowBert-W+W is in training_config/pretraining/knowbert_wordnet_wiki.jsonnet
.