adding multilingual ner to blank spacy language class

bdewilde commented 5 years ago

Is it ~possible and~ advisable to add a multilingual NER pipe from the xx model to a blank spaCy language that supports tokenization but doesn't yet have any statistical models for annotating documents? If so, what's the "correct" way to do this? I don't see anything in the docs, but figure using language-specific tokenization is better than multi-lingual defaults.

Which page or section is this issue related to?

https://spacy.io/usage/processing-pipelines

bdewilde commented 5 years ago

Okay, it does seem to be possible —

>>> import spacy
>>> nlp_xx = spacy.load("xx_ent_wiki_sm")
>>> nlp_ar = spacy.blank("ar")
>>> nlp_ar.add_pipe(nlp_xx.get_pipe("ner"), name="ner", last=True)
>>> nlp_ar.pipeline
[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1229cf708>)]

— I'm mostly wondering if this is advisable, and if so, if there's a better way to do it.

ines commented 5 years ago

In theory, this is possible, yes, and your code example would be the way to do it. In this case, there are a few things to consider, though:

The two nlp objects (xx and ar) have different vocabs, which can cause conflicts. This is especially relevant for things like the tag map and other label schemes – e.g. you might end up with an entity label that's only present in the xx vocab, but not in the ar vocab.
You might end up with pickling problems, due to the two vocabs. Pipeline components typically share the same vocab with the same identity – but here you have one component that refers to a vocab with a different identity.
If the new tokenization is very different, the NER component may perform significantly worse. For instance, it might have seen a lot of examples and tokens during training that will never be "true" with the new tokenization, e.g. because it never splits in a way that produces those tokens.

bdewilde commented 5 years ago

Makes total sense, thanks Ines! I had a feeling these things might be issues, but somehow convinced myself that it was no big deal. 😅 Will use the multilingual NER pipeline as-is, end-to-end.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

adding multilingual ner to blank spacy language class #3765

Which page or section is this issue related to?