PipelineAI / pipeline

PipelineAI
https://generativeaionaws.com
Apache License 2.0
4.17k stars 972 forks source link

[Demo] TensorFlow + SyntaxNet (Parsey McParseface) #89

Closed cfregly closed 7 years ago

cfregly commented 8 years ago

Link to blog post: https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html

Link to TF docs: https://www.tensorflow.org/versions/r0.9/tutorials/syntaxnet/index.html#syntaxnet

Link to dataset: https://github.com/tensorflow/models/tree/master/syntaxnet

Link to our wiki: https://github.com/fluxcapacitor/pipeline/wiki/TensorFlow-Support#todo--syntaxnet

andyzeli commented 8 years ago

interesting tidbit from the author of a nice library called spacy

As the author of the spaCy.io parser, I’ve been looking forward to the SyntaxNet release for some time. Our mission is to get the latest and greatest NLP technologies out into real products, so they can be put to work. There’s no doubt that Google’s release of SyntaxNet helps us do that.

Probably the most obvious difference you’ll see between spaCy and SyntaxNet is one of intent, which drives pretty deep differences in design. SyntaxNet is a library for researching NLP models, while spaCy is a library for applying NLP models in production.

spaCy is written with the assumption that 99% of the time you’ll be running the models. When I was in research, my software assumed that 99% of the time I would be training the models. I think SyntaxNet is built with similar assumptions.

SyntaxNet’s parser and POS tagging models are more accurate, load faster, and consume much less memory than spaCy. spaCy’s models are faster, and come with built-in sentence segmentation, which SyntaxNet lacks.

I suspect that many users might find spaCy actually more accurate for their use case overall, due to segmentation issues. SyntaxNet is trained to expect input with perfect segmentation.[1] Real world text is not so well behaved, particularly from social media, where punctuation is bad and new lines are used inconsistently. We have a trick that SyntaxNet lacks. spaCy’s parser deals with this by parsing the whole document at once. You don’t declare the sentence boundaries up-front —- instead, they’re derived from the syntactic structure. The benefits of this will depend on your specific use-case. I’ve reached out to the SyntaxNet team to see whether we can get a whole document evaluation of SyntaxNet done, starting from the raw text.

Finally, SyntaxNet gives you almost nothing to help you actually put the annotations it produces to work. spaCy’s API is very productive in this respect. We’ve taken care to design things in a way that makes it easy to use all the different levels of representation together. You can merge phrases into single tokens, get the word vector of the head of a particular named entity, etc.

The best of both worlds will be to load the analyses of SyntaxNet into spaCy for use, possibly with spaCy used as a sentence segmentation pre-process. This might seem a little odd, since it will mean that you’re parsing the sentence with spaCy just to get the sentence boundaries, and then discarding the parse tree to use the one from SyntaxNet. However, spaCy’s parser is about 20x faster than SyntaxNet’s, so the cost of this pre-process should be negligible.

spaCy’s parsing accuracy will catch up to SyntaxNet’s somewhat, once we finish our neural network model. At the moment we’re at about 92.4 on the WSJ evaluation, while they’re at about 94.5. I think we can hit about 94 while staying very fast. There’s zero doubt that researchers from Google will continue to publish better models, as will researchers at other private and public labs all over the world.

[1] Actually, SyntaxNet is trained on text from two sources: gold-standard input from a treebank, and a “silver standard” produced by voting between a panel of 3 state-of-the-art parser, selected for diversity. The analyses produced by this panel will include segmentation errors. However, none of the parsers in the panel have seen segmentation errors during training either! This means the best analysis they produce for sentences with segmentation errors might still be pretty bad.

samjabrahams commented 8 years ago

https://research.googleblog.com/2016/08/meet-parseys-cousins-syntax-for-40.html

samjabrahams commented 8 years ago

https://github.com/tensorflow/models/blob/master/syntaxnet/universal.md

cfregly commented 8 years ago

https://spacy.io/blog/syntaxnet-in-context

cfregly commented 7 years ago

cute demo, not critical