explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

How do use the NER stanfordnlp annotator? #16

Closed askhogan closed 5 years ago

askhogan commented 5 years ago

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained stanfordnlp model:

Statistical tokenization (reflected in the Doc and its tokens) Lemmatization (token.lemma and token.lemma) Part-of-speech tagging (token.tag, token.tag, token.pos, token.pos) Dependency parsing (token.dep, token.dep, token.head) Sentence segmentation (doc.sents)

Where is Named Entity Recognition? https://stanfordnlp.github.io/CoreNLP/ner.html

Also SpaCy's own website says specifically state of the art comes with CORENLP not SpaCy

https://cl.ly/285a4edaf7a5/Image%202019-06-06%20at%207.32.06%20PM.png

honnibal commented 5 years ago

Where is Named Entity Recognition? https://stanfordnlp.github.io/CoreNLP/ner.html

Well...There's currently no NER predictions from StanfordNLP models, and so this package does not provide any NER predictions. You appear to be confusing StanfordNLP and Stanford CoreNLP (not our names). From https://stanfordnlp.github.io/ :

Stanford CoreNLP is our Java toolkit which provides a wide variety of NLP tools.

StanfordNLP is a new Python project which includes a neural NLP pipeline and an interface for working with Stanford CoreNLP in Python.

Eventually the StanfordNLP models will support NER...At that point we'll update this package to provide those NER predictions to spaCy.

The CoreNLP models are quite old, and not really actively maintained. They also require you to run GPL Java software. Stanford's current neural models are the ones we refer to when we say that their package has "the latest state of the art".

Also:

this is a post to say I want to get work done - I want to use a tool like prodigy - but everytime I spend an hour or so of additional dev time and the SpaCy models give me junk entities and then this package shows up within evolution as a potential bridge - yet it lacks the most essential annotator. Thats my point.

Come on. Have a word with yourself mate.

askhogan commented 5 years ago

@hannival thanks for the explanation. I was using StanfordNLP - the python library for CoreNLP - so I inferred from your StanfordNLP reference that you were also referencing CoreNLP, but I understand from your point now that StanfordNLP has two purposes and this library is only for one of those purposes.

On your website - as a user, I was confused because I saw that you compare CoreNLP and then below compare StanfordNLP which is CoreNLP's python library - which led me to believe you were comparing the same thing - https://cl.ly/bcba0a346b60/Image%202019-06-09%20at%205.26.51%20AM.png


CoreNLP is updated regularly - and was updated as recent as < 24 hours ago -> https://github.com/stanfordnlp/CoreNLP/commits/master

CoreNLP does not do what Prodigy can do with active reinforced learning - which is why it would be great if you had a way to bridge this gap. Is there a way to do that?

I am giving my experience wherein SpaCY 2+ with included models produces inaccurate results when using the built in NER. Perhaps my use case of parsing government related documents is outside of where SpaCY does well. But CoreNLP and StanfordNER did far better than SpaCY 2+ in terms of accurately categorizing text for Organization, Person, and Dollar amount in government related documents (e.g. deed records). I am more than happy to share my files and results.

askhogan commented 5 years ago

Updating ticket after analysis

  1. The StanfordNLP NER does do a better job out of the gate - but upon more analysis that is because StanfordNLP NER annotator uses many more phrase pattern matching rules under places like edu/stanford/nlp/models/kbp/regexner_caseless.tab

  2. When these are added to Spacy - Spacy performs better

  3. After spending time with Spacy - it is much more customizable and easier to work with. Very quickly your results will become better than StanfordNER