explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.2k stars 4.32k forks source link

Retrain JUST the NER component to have character CNN features? #6432

Closed dnm1977 closed 3 years ago

dnm1977 commented 3 years ago

Recent work -- (Gu, et al., 2020; "PubMedBERT") and (Veysel and Talby, 2020) -- seems to show that OOVs (writ large, meaning what your word-level vocabulary is or is not) are detrimental to NER performance, at least in bio-medical domains (gene and protein tagging, e.g.). The latter publication above uses a character CNN feeding a LSTM and outperforms the state of the art -- including Stanza and BERT-derived models, such as PubMedBERT.

With that in mind, I tried to retrain just a NER model using the ScispaCy^^ base models (en_core_sci_md-0.3.0), and I got the following error when turning on the --chr/--use-chars

The training command:

$ python -m spacy train en /path/to/output/dir /path/to/train.json /path/to/dev.json --base-model 'en_core_sci_md' --pipeline ner -R -v 'en_core_sci_md' -ne 2 --meta-path /path/to/model/en_core_sci_md/meta.json --chr

This gives the following error. (Sorry I can't copy the whole thing; I'm visually copying and typing from screen to screen at the moment.)

...
ValueError [E149] Error deserializing model. Check that the config used to create the component matches the model being loaded.
...

This happened when the model was reloaded (presumably) from the ground up -- parsing, tagging, NER, etc. -- to run on the validation/dev set after the first iteration.

Note that the model trains fine, for several iterations, and saves a working model if I omit the --chr/--use-chars flag. This is doubtless because the models have tied parameters and there is no char CNN component to the Tok2Vec features for any of the other parts of the whole pipeline (tagging and parsing). I don't want to retrain the parser and tagger to use char CNN features, so is there a workaround?

(^^Perhaps I should crosspost there, but this seems to be a spaCy issue -- something about model component mismatch.)

Your Environment

adrianeboyd commented 3 years ago

This is the right place, it's definitely a spacy issue.

The pipeline component models are entirely separate in spacy v2 so this should be possible, but it looks like there are a few bugs here. The --use-chars option probably should have an "experimental" label on it because it's not really ready for production use in v2. It was developed with the morphologizer in mind (which is also only partially implemented in v2), so the older components haven't been updated to reflect all the newer options, and since we don't use it in any of the models we train internally, it hasn't been tested thoroughly. The problems:

Here's a branch where it should be working if you want to test it: https://github.com/adrianeboyd/spaCy/tree/example/use-chars-v2.3.3

You should be able to check this branch out and just run pip install . to install it in your current venv.

Overall I think I wouldn't recommend using anything with the character embeddings in production for v2. It is working correctly in the upcoming spacy v3 (we've tested it a lot with the now fully-implemented morphologizer) so you can try that out now in spacy-nightly if you'd like.

dnm1977 commented 3 years ago

Hey, thanks, Adriane.

Trying this out, though, I get the following (can't even import spacy):

$ pwd
/Users/dennismehay/Documents/spacy_experimental/tmp
$ git clone https://github.com/adrianeboyd/spaCy
[...]
$ cd spaCy
$ git checkout "example/use-chars-v2.3.3"
$ git branch
* example/use-chars-v2.3.3
  master
$ python -m venv ~/.virtualenvs/spacy_char_experimental
$ source ~/.virtualenvs/spacy_char_experimental/bin/activate
(spacy_char_experimental)$ pip install .
[...all goes well...]
(spacy_char_experimental)$ python -c "import spacy"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/dennismehay/Documents/spacy_experimental/tmp/spaCy/spacy/__init__.py", line 12, in <module>
    from . import pipeline
  File "/Users/dennismehay/Documents/spacy_experimental/tmp/spaCy/spacy/pipeline/__init__.py", line 4, in <module>
    from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
ModuleNotFoundError: No module named 'spacy.pipeline.pipes'
dnm1977 commented 3 years ago

Also, unfortunately, I'm stuck with spaCy v2, at least until the Allen AI institute retrains ScispaCy to be spaCy 3.x compliant, or they release their data so I can retrain everything from scratch. (I know the datasets are mostly open-access -- except for Ontonotes and maybe others? -- but re-munging all of that data is not budgeted for at the moment, unfortunately.)

adrianeboyd commented 3 years ago

It's try to import spacy from the source directory rather than the installed package. Just change to a different directory before running import spacy and I think it should work?

adrianeboyd commented 3 years ago

And the cleaner version of this (against the current master rather than a few bug fixes ago) is in the PR linked above, #6441. As I mention in the comments, I'm not sure we're going to make this change because it's not really supported well enough.

dnm1977 commented 3 years ago

Ah, heh. Yeah, about 20m and a mug of coffee after I posted, I realized this. (Oh, Python's preference for local directories over installed packages. I'm sure there's a reason.)

[edit] The change would be useful to us, possibly, since we're using ScispaCy, and (as I mentioned above), their models are all on spaCy v2.x. ScispaCy is crucial for sentence splitting, tokenization, tagging and parsing, even if their NER tagging is not 100% what we would like it to be (in terms of how and what they tag, not necessarily how well the models perform formally speaking).

dnm1977 commented 3 years ago

So this worked pretty well, in terms of accuracy. The model performance is up from a P/R/F1 of 77/79/76 to 81/79/80 on the dev set with no change in hyperparameters. The test set performance is up from a P/R/F1 of 77/73/75 to 79/78/78

[edit: This is on the BioCreative II Gene Mention dataset with the same train/dev/test split as here https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC2GM-IOB, but removing any sentence that occurs in two or more splits (about 40 or so sentences removed), and with the original char-level annotations projected onto spaCy tokenization, not the tokenization that they use at MTL-Bioinformatics-2016]

I think even if GPUs don't support this, spaCy is fast enough on CPUs (orders of magnitude faster than, e.g., Stanza on GPUs -- cf. https://arxiv.org/pdf/2003.07082.pdf) to make it worth it. Placing all of the guardrails in the documentation would be the challenge.

adrianeboyd commented 3 years ago

The GPU issue is more that the resulting models are unexpectedly brittle than about the speed, since spacy's mainly been optimized for CPU performance anyway. We don't have any way to indicate that models are CPU-only and I don't think it's probably something we'd like to try to add, either.

There are also some parameters (nM and nC) for the character embeddings that you currently can't set easily. It's just not really ready for use in v2 and everything is much much better in v3, and hopefully there will be updated scispacy models at some point after the final v3.0 release.

fcggamou commented 3 years ago

Hey @adrianeboyd could you point me on the direction of how to use this char-embedding feature in the nightly version? I would like to give it a try. Thanks!

adrianeboyd commented 3 years ago

https://nightly.spacy.io/api/architectures#CharacterEmbed

https://nightly.spacy.io/usage/layers-architectures#sublayers

(Hmm, there's really no good way to search the nightly docs, that doesn't make things easy to find...)

dnm1977 commented 3 years ago

@adrianeboyd, is there an approximate estimated release date for spaCy v3?

adrianeboyd commented 3 years ago

No, sorry, we'll wait until we think it's ready.

adrianeboyd commented 3 years ago

We've decided to remove this option from the CLI/docs for the next release of v2 since it's not stable enough.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.