💫 Multi-task CNN for parser, tagger and NER

honnibal commented 7 years ago

The implementation of the neural network model for spaCy's parser, tagger and NER is now complete. :tada: There are still a lot of hyper-parameters to tune, efficiency improvements to make, and hacks to unhack — but the main work is done.

The code is on the v2 branch. Currently, it requires Chainer, which may cause installation to fail on machines without a GPU. This will obviously be fixed prior to release.

Preliminary results

Current parser performance on the AnCora Spanish corpus:

spaCy v1	spaCy v2	ParseySaurus (SyntaxNet)
87.5	90.96	91.02

Parse times are down on CPU vs the 1.x branch — currently on CPU the neural network model is 4x slower. With a modest GPU, the v2 model is about as fast as v1 with about 3 CPU threads. I think we can claw back most of this lost performance, and get to around half the linear model's speed running on CPU. The plan is to continue focusing on CPU runtime for now: I think this will continue to be the cheapest and most convenient way for people to run spaCy. Of course, GPU training is nice 😊

The parsing model is a blend of recent results. The two recent inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan ^[1], and the SyntaxNet team from Google. The foundation of the parser is still based on the work of Joakim Nivre ^[5], who introduced the transition-based framework ^[7], the arc-eager transition system, and the imitation learning objective. There's a short bibliography at the end of the issue.

Outline of the model

The model is implemented using Thinc, our machine learning library. (The parsing model uses Thinc v6.6.0, which was just released.) We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger, parser and NER, and will also be shared by the future neural lemmatizer. Because the parser shares these layers with the tagger, the parser does not require tag features. I got this trick from David Weiss's "Stack Combination" paper ^[2].

To boost the representation, the tagger actually predicts a "super tag" with POS, morphology and dependency label. ~~This part is novel~~ – and it helps quite a lot, especially for languages such as Spanish where the POS task is by itself too easy. (Edit: Actually, not so novel -- and I'd actually read this paper, and even discussed it with Yoav! So easy to lose track...)

The tagger predicts these supertags by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. The current context tokens:

S0, S1, S2: Top three words on the stack
B0, B1: First two words of the buffer
S0L1, S0L2: Leftmost and second leftmost children of S0
S0R1, S0R2: Rightmost and second rightmost children of S0
S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition ^[4]). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature with respect to the words in the batch. (Note that our token vectors come from the CNN — so we can't play this trick over the vocabulary. That's how Stanford's NN parser ^[3] works — and why its model is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train.

Currently the parser's loss function is multilabel log loss ^[6], as the dynamic oracle allows multiple states to be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. I've read that L2 losses generally don't work great in neural networks. This is disappointing. Maybe I'm missing some tricks here?

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick. However, I do like parsing the entire input, without having to have sentence boundary detection as a pre-process. This is tricky to do correctly with the beam. The current beam implementation introduces quadratic time complexity for long sequences, as it copies state data that's O(N) in the length of the sentence.

Bibliography

[1] Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Eliyahu Kiperwasser, Yoav Goldberg. (2016) ↵

[2] Stack-propagation: Improved Representation Learning for Syntax. Yuan Zhang, David Weiss (2016) ↵

[3] A Fast and Accurate Dependency Parser using Neural Networks. Danqi Cheng, Christopher D. Manning (2014) ↵

[4] An Improved Non-monotonic Transition System for Dependency Parsing. Matthew Honnibal, Mark Johnson (2015) ↵

[5] A Dynamic Oracle for Arc-Eager Dependency Parsing. Yoav Goldberg, Joakim Nivre (2012) ↵

[6] Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. Stefan Riezler et al. (2002) ↵

[7] Parsing English in 500 Lines of Python. Matthew Honnibal (2013). ↵

Related issues

General: #621
English: #512, #524, #753, #905, #933, #945, #954, #977, #1021, #1040, #1042, #1043
German: #725, #1040
French: #1044

anna-hope commented 7 years ago

Thanks for writing this up.

Is this parsing model currently available on the develop branch?

ines commented 7 years ago

See the v2.0.0 alpha release notes and #1105 🎉

mollerhoj commented 7 years ago

@honnibal: Please correct me if I'm wrong, but the shared CNN in spacy 2.0 seems to have a big drawback: That the training data for POS, NER and dependencies must come from the same sentences.

For languages where there aren't many resources around, it's often the case that the training data for POS tags, NER and dependency tags come from many different resources. (Usually, corpora with dependency tags are quite small, whereas corpora with POS tags are large).

honnibal commented 7 years ago

@mollerhoj The newest release fixes this, by adding an update_shared flag, and giving each model a private copy of the CNN as well. See here for further discussion: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

mollerhoj commented 7 years ago

@honnibal Yay! A new blog post, can't wait to read it! It's much better to read your posts than try to digest scientific articles. Keep up the good work, I'm a big fan!

StrawberryDream commented 6 years ago

@honnibal Hi, I installed the version 2.0.7 and also synced the latest code on the master branch. But I did not find the "update_shared" flag in the function update() in pipeline.pyx. I wonder if this feature was implemented in other ways. How can I tune the POS tagger and make the dependency parser to use the tuned POS tagger? Thank you very much for your help!

muzaluisa commented 6 years ago

May I ask which paper do you use for NER implementation? Or where can I find NER implementation details? Here you mention dependency parser, but not specifically NER. Thanks

l2edzl3oy commented 6 years ago

@muzaluisa see video here and blog post here, alongside all the above information in this issue thread. Those were useful references for me and I hope they are for you too :)

@honnibal @ines Token currently has lemma and norm attributes. Based on my understanding, norm is used as a feature for the model, and I was wondering how lemma is used in the model (if any). I'm trying to wrap my head around the difference between lemma and norm, as both seem to be the "base" form of the original text for the token (i.e. the orth attribute) and hence should have the same value. I was wondering if this distinction was made due to the future neural lemmatizer - am I right to assume so? Thanks!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy