aatkinson / deep-named-entity-recognition

Use RNNs to identify entities in news queries
56 stars 21 forks source link

Sequence Labelling (seq2seq) Model - News Tags

Matches words, embedded with word2vec to news item tags - in sequence.

By Adam Atkinson.

Usage

How to Test

Query Loop

$ python ner_test.py

Example execution:

$ python ner_test.py
Type a query (type "exit" to exit):
news about Obama

news    B-NEWSTYPE
about   O
Obama   B-KEYWORDS

Tagged News Data File (training data)

$ python my_test_script.py [# samples to read]

Example execution:

$ python my_test_script.py

...

Bad Prediction!
Words: ['home', 'trending', 'news']
Tags : ['B-SECTION', 'B-NEWSTYPE', 'I-NEWSTYPE']
Preds: ['B-KEYWORDS', 'B-NEWSTYPE', 'I-NEWSTYPE']

...

~~~ Summary ~~~
# samples read = 1000
Correctly classified samples = 0.9740
Correctly classified frames = 0.9962

How to Train Your Own

$ python ner_train.py  your_model.hd5 [# BiLSTM layers] [# epochs]

Wordvec and tagged news filepaths are hardcoded. Model parameters are hardcoded.

Program and Script Structure

data_util.py    // Reads, parses, and embeds data so it can be used by the model.
                // Defines DataUtil class which owns parsed data and utility functions.

data_test.py    // Tests data_util.py

ner_model.py    // Implements the model: definition, training, testing, prediction.
                // Defines NERModel class which owns model, parameters, and train/test functions.

ner_test.py     // Reads sequences of words from the command line and prints the predicted tags.

ner_train.py    // Trains a new model.

my_test_script.py   // Script that reads news_tagged_data.txt and runs against the model.

model_blstm_150_ep50.h5     // Single layer BiLSTM model
model_blstm_150_150_ep50.h5 // Double layer BiLSTM model

Environment

Software Requirements

Python 2.7.6
Keras==1.0.8
numpy==1.11.1
pandas==0.18.1
scipy==0.18.0
Theano==0.8.2

OS

Linux adama-ideapad 3.19.0-68-generic #76~14.04.1-Ubuntu SMP Fri Aug 12 11:46:25 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Hardware

Model:  Lenovo IdeaPad U430p Notebook
CPU:    Intel(R) Core(TM) i5-4210U CPU @ 1.70 GHz , (32KB/256KB/3MB) L1/L2/L3 cache
RAM:    8 GB (8GiB SODIMM DDR3 Synchronous @ 1600 MHz)
GPU:    None 

Performance

One BiLSTM Layer:

Epoch Training Accuracy Test Accuracy
10 0.9513 0.9567
25 0.9903 0.9890
50 0.9988 0.9935

Two BiLSTM Layers:

Epoch Training Accuracy Test Accuracy
10 0.9420 0.9556
25 0.9924 0.9887
50 0.9992 0.9929

The 2 layer BiLSTM model seems to overfit a very slightly more than the 1-layer model does, but I use the 2-layer one by default because stacked RNNs are cool :)

Methodology

Overview

Network: 150-unit BiLSTM with dropout -> 150-unit BiLSTM with dropout -> Softmax // All stacked

Loss: categorical cross-entopy

Optimizer: Adam

Preprocessing, Features, and Data Format

  1. Encode the words (X) a. Assign each word a 300-length word2vec representation, from the provided file. b. If no vector exists for a word, sample a 300-length vector from a multivariate normal distribution and normalize it.

  2. Encode the tags (y) a. Read through the tagged news data to get the number of tags (classes). b. Add 'nil' class for unknown data. c. Assign each class an id. d. Assign each class a 1-hot vector, length= # classes, where the tag id is the index of the 1.

  3. Assemble the data a. Read the tagged data from the provided file. b. Determine the maximum sequence length out of all the data. c. For each (sentence, tags) example, i. Map each word in sentence to a vector from 1). ii. Map each tag to a one-hot vector from 2). iii). Pad the sentence

NOTE On Prediction

Unknown words encountered during prediction are assigned a unit, 300-length vector from a multivariate normal.

Model Details and Design Choices

Things I Could Work On (TODO)

Notes

References