this is a proof of concept for using various CRF solutions for named entity recognition. the demos here use all-lower-cased text in order to simulate NER on text where case information is not available (e.g. automatic speech recognition output)
June 08 2018 update:
pycrfsuite
report for both modelspycrfsuite
codegensim
keras
keras-contrib
tensorflow
numpy
pandas
python-crfsuite
data-preprocessing.ipynb
to generate formatted model datapycrfsuite-training.ipynb
to fit modelresults/pyCRF-sample.csv
for sample outputdata-preprocessing.ipynb
to generate formatted model datakeras_training.ipynb
to train and save modelkeras-decoding.ipynb
to load saved model and decode test sentencesresults/keras-biLSTM-CRF_sample.csv
for sample outputtrained on the ConLL-2002 English NER dataset:
https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
NB: convert to utf-8 first, converted csv is in repository
see: preprocessing.ipynb
see: pycrfsuite-training.ipynb
model inputs: word and pos-tag hand-engineered features
model output: named entity tag sequences
see: keras_training.ipynb
model inputs: word and pos-tag integer-indexed sequences (padded)
model output: named entity tag integer-indexed sequences (padded)
see: keras-decoding.ipynb
for code, results/XXXX-sample.csv
for sample decode
this file decodes test set results into human-readable format.
adjust the number of outputs to see in the following line:
for sent_idx in range(len(X_test_sents[:500])):
<< adjust 500 up or down
per-tag results on the withheld test set
py-crfsuite
precision recall f1-score support
B-art 0.31 0.06 0.10 69
I-art 0.00 0.00 0.00 54
B-eve 0.52 0.35 0.42 46
I-eve 0.35 0.22 0.27 36
B-geo 0.85 0.90 0.87 5629
I-geo 0.81 0.74 0.77 1120
B-gpe 0.94 0.92 0.93 2316
I-gpe 0.89 0.65 0.76 26
B-nat 0.73 0.46 0.56 24
I-nat 0.60 0.60 0.60 5
B-org 0.78 0.69 0.73 2984
I-org 0.77 0.76 0.76 2377
B-per 0.81 0.81 0.81 2424
I-per 0.81 0.90 0.85 2493
B-tim 0.92 0.83 0.87 2989
I-tim 0.82 0.70 0.75 1017
avg / total 0.83 0.82 0.82 23609
keras biLSTM-CRF
precision recall f1-score support
B-art 0.26 0.14 0.18 66
I-art 0.17 0.07 0.10 54
B-eve 0.34 0.25 0.29 44
I-eve 0.20 0.21 0.20 34
B-geo 0.87 0.90 0.89 5436
I-geo 0.79 0.83 0.81 1065
B-gpe 0.96 0.95 0.95 2284
I-gpe 0.71 0.60 0.65 25
B-nat 0.58 0.65 0.61 23
I-nat 1.00 0.40 0.57 5
B-org 0.80 0.75 0.77 2897
I-org 0.84 0.77 0.81 2286
B-per 0.84 0.85 0.84 2396
I-per 0.84 0.90 0.87 2449
B-tim 0.90 0.89 0.90 2891
I-tim 0.84 0.75 0.80 957
avg / total 0.85 0.85 0.85 22912