spaCy >= 2.0 support - Githubissues

sam-writer commented 4 years ago

Hi Chris, thanks for the big 2.0 updates!

This is regarding the following section of the README

Note: ERRANT does not support spaCy 2 at this time. spaCy 2 POS tags are slightly different from spaCy 1 POS tags and so ERRANT rules, which were designed for spaCy 1, may not always work with spaCy 2.

Since Python can't handle having multiple versions of a given library in a single project, and we need to use features that were introduced post spacy 2.0, we currently have to keep ERRANT isolated in a separate service which we talk to over HTTP. This is not ideal. Since ERRANT now supports passing in an nlp spacy object, it seems like adding support for spacy >= 2.0 would not be bad.

Specifically, I think we could check nlp._meta['spacy_version']. If the spacy version is less than 2.0, nlp._meta doesn't exist, above 2.0, this gives us the exact spacy version. For this current purpose, just testing is_spacy_2_or_above = bool(getattr(nlp, "_meta", False)) should be enough. Then the quickest fix would be to just map the 2.0 tags to 1.9 tags if is_spacy_2_or_above.

Is this acceptable? If not, is there some other path to supporting spacy 2.0+? Thank you!

EDIT: we are happy to work on this, we'd just like to find an approach that you would approve.

chrisjbryant commented 4 years ago

Hey Sam,

Yes, Spacy 2 support is definitely on the to-do list. I mainly wanted the first pip version to be compatible with the BEA shared task, but newer versions will change the results slightly.

Some good news: Spacy finally updated their English tag map to the same one that I use, so as long as you use spacy >= 2.2.2, rule compatibility shouldn't be a problem. I'm in the process of testing ERRANT with this version of spacy too, so hopefully ERRANT 2.1 will come out soon!

chrisjbryant commented 4 years ago

Quick update:

I tried using ERRANT with the latest version of spacy (2.2), and the only thing that broke is a call to an old lemmatiser in the classifier. For a quick fix, you can change the same_lemma function to:

if o_tok.lemma == c_tok.lemma: return True
return False

Otherwise, it looked as if annotation performance decreased by about 1% and processing time took about 3 times longer. I'll need to debug the accuracy loss (and have some ideas already), but there's not really anything I can do about the speed loss...

sam-writer commented 4 years ago

Thanks for the update.

The performance thing is interesting, since spacy2.0 was supposed to be faster... In both cases, is this using en_core_web_sm?

chrisjbryant commented 4 years ago

Yes, that's using en_core_web_sm in both cases.

From what I've read, it was never supposed to be faster, just slightly more accurate and memory efficient. Check the "Model Comparison" table on this page. It shows a large speed drop for a relatively small performance gain. It's true that the parser and NER components are >5% more accurate, but ERRANT mainly relies on the POS tagger, so the ~0.5% POS improvement isn't really significant.

There's also a long issue thread about it here when it went from v1 to v2 , and it seemed to me that the conclusion was that it'll never be as fast as v1.

chrisjbryant / errant

spaCy >= 2.0 support #10