Closed dnm1977 closed 3 years ago
This is the right place, it's definitely a spacy issue.
The pipeline component models are entirely separate in spacy v2 so this should be possible, but it looks like there are a few bugs here. The --use-chars
option probably should have an "experimental" label on it because it's not really ready for production use in v2. It was developed with the morphologizer in mind (which is also only partially implemented in v2), so the older components haven't been updated to reflect all the newer options, and since we don't use it in any of the models we train internally, it hasn't been tested thoroughly. The problems:
spacy train
code, the --use-chars
option looks buggy. I don't think it actually enables the character embedding, it just removes the subword features like prefix/suffix. It's also just a few lines to fix this.Here's a branch where it should be working if you want to test it: https://github.com/adrianeboyd/spaCy/tree/example/use-chars-v2.3.3
You should be able to check this branch out and just run pip install .
to install it in your current venv.
Overall I think I wouldn't recommend using anything with the character embeddings in production for v2. It is working correctly in the upcoming spacy v3 (we've tested it a lot with the now fully-implemented morphologizer) so you can try that out now in spacy-nightly
if you'd like.
Hey, thanks, Adriane.
Trying this out, though, I get the following (can't even import spacy
):
$ pwd
/Users/dennismehay/Documents/spacy_experimental/tmp
$ git clone https://github.com/adrianeboyd/spaCy
[...]
$ cd spaCy
$ git checkout "example/use-chars-v2.3.3"
$ git branch
* example/use-chars-v2.3.3
master
$ python -m venv ~/.virtualenvs/spacy_char_experimental
$ source ~/.virtualenvs/spacy_char_experimental/bin/activate
(spacy_char_experimental)$ pip install .
[...all goes well...]
(spacy_char_experimental)$ python -c "import spacy"
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/dennismehay/Documents/spacy_experimental/tmp/spaCy/spacy/__init__.py", line 12, in <module>
from . import pipeline
File "/Users/dennismehay/Documents/spacy_experimental/tmp/spaCy/spacy/pipeline/__init__.py", line 4, in <module>
from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker
ModuleNotFoundError: No module named 'spacy.pipeline.pipes'
Also, unfortunately, I'm stuck with spaCy v2, at least until the Allen AI institute retrains ScispaCy to be spaCy 3.x compliant, or they release their data so I can retrain everything from scratch. (I know the datasets are mostly open-access -- except for Ontonotes and maybe others? -- but re-munging all of that data is not budgeted for at the moment, unfortunately.)
It's try to import spacy from the source directory rather than the installed package. Just change to a different directory before running import spacy
and I think it should work?
And the cleaner version of this (against the current master
rather than a few bug fixes ago) is in the PR linked above, #6441. As I mention in the comments, I'm not sure we're going to make this change because it's not really supported well enough.
Ah, heh. Yeah, about 20m and a mug of coffee after I posted, I realized this. (Oh, Python's preference for local directories over installed packages. I'm sure there's a reason.)
[edit] The change would be useful to us, possibly, since we're using ScispaCy, and (as I mentioned above), their models are all on spaCy v2.x. ScispaCy is crucial for sentence splitting, tokenization, tagging and parsing, even if their NER tagging is not 100% what we would like it to be (in terms of how and what they tag, not necessarily how well the models perform formally speaking).
So this worked pretty well, in terms of accuracy. The model performance is up from a P/R/F1 of 77/79/76 to 81/79/80 on the dev set with no change in hyperparameters. The test set performance is up from a P/R/F1 of 77/73/75 to 79/78/78
[edit: This is on the BioCreative II Gene Mention dataset with the same train/dev/test split as here https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC2GM-IOB, but removing any sentence that occurs in two or more splits (about 40 or so sentences removed), and with the original char-level annotations projected onto spaCy tokenization, not the tokenization that they use at MTL-Bioinformatics-2016]
I think even if GPUs don't support this, spaCy is fast enough on CPUs (orders of magnitude faster than, e.g., Stanza on GPUs -- cf. https://arxiv.org/pdf/2003.07082.pdf) to make it worth it. Placing all of the guardrails in the documentation would be the challenge.
The GPU issue is more that the resulting models are unexpectedly brittle than about the speed, since spacy's mainly been optimized for CPU performance anyway. We don't have any way to indicate that models are CPU-only and I don't think it's probably something we'd like to try to add, either.
There are also some parameters (nM
and nC
) for the character embeddings that you currently can't set easily. It's just not really ready for use in v2 and everything is much much better in v3, and hopefully there will be updated scispacy models at some point after the final v3.0 release.
Hey @adrianeboyd could you point me on the direction of how to use this char-embedding feature in the nightly version? I would like to give it a try. Thanks!
https://nightly.spacy.io/api/architectures#CharacterEmbed
https://nightly.spacy.io/usage/layers-architectures#sublayers
(Hmm, there's really no good way to search the nightly docs, that doesn't make things easy to find...)
@adrianeboyd, is there an approximate estimated release date for spaCy v3?
No, sorry, we'll wait until we think it's ready.
We've decided to remove this option from the CLI/docs for the next release of v2 since it's not stable enough.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Recent work -- (Gu, et al., 2020; "PubMedBERT") and (Veysel and Talby, 2020) -- seems to show that OOVs (writ large, meaning what your word-level vocabulary is or is not) are detrimental to NER performance, at least in bio-medical domains (gene and protein tagging, e.g.). The latter publication above uses a character CNN feeding a LSTM and outperforms the state of the art -- including Stanza and BERT-derived models, such as PubMedBERT.
With that in mind, I tried to retrain just a NER model using the ScispaCy^^ base models (
en_core_sci_md-0.3.0
), and I got the following error when turning on the--chr
/--use-chars
The training command:
This gives the following error. (Sorry I can't copy the whole thing; I'm visually copying and typing from screen to screen at the moment.)
This happened when the model was reloaded (presumably) from the ground up -- parsing, tagging, NER, etc. -- to run on the validation/dev set after the first iteration.
Note that the model trains fine, for several iterations, and saves a working model if I omit the
--chr
/--use-chars
flag. This is doubtless because the models have tied parameters and there is no char CNN component to the Tok2Vec features for any of the other parts of the whole pipeline (tagging and parsing). I don't want to retrain the parser and tagger to use char CNN features, so is there a workaround?(^^Perhaps I should crosspost there, but this seems to be a spaCy issue -- something about model component mismatch.)
Your Environment