Closed bdewilde closed 5 years ago
Okay, it does seem to be possible —
>>> import spacy
>>> nlp_xx = spacy.load("xx_ent_wiki_sm")
>>> nlp_ar = spacy.blank("ar")
>>> nlp_ar.add_pipe(nlp_xx.get_pipe("ner"), name="ner", last=True)
>>> nlp_ar.pipeline
[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1229cf708>)]
— I'm mostly wondering if this is advisable, and if so, if there's a better way to do it.
In theory, this is possible, yes, and your code example would be the way to do it. In this case, there are a few things to consider, though:
nlp
objects (xx
and ar
) have different vocabs, which can cause conflicts. This is especially relevant for things like the tag map and other label schemes – e.g. you might end up with an entity label that's only present in the xx
vocab, but not in the ar
vocab.Makes total sense, thanks Ines! I had a feeling these things might be issues, but somehow convinced myself that it was no big deal. 😅 Will use the multilingual NER pipeline as-is, end-to-end.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Is it ~possible and~ advisable to add a multilingual NER pipe from the
xx
model to a blank spaCy language that supports tokenization but doesn't yet have any statistical models for annotating documents? If so, what's the "correct" way to do this? I don't see anything in the docs, but figure using language-specific tokenization is better than multi-lingual defaults.Which page or section is this issue related to?
https://spacy.io/usage/processing-pipelines