explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

adding multilingual ner to blank spacy language class #3765

Closed bdewilde closed 5 years ago

bdewilde commented 5 years ago

Is it ~possible and~ advisable to add a multilingual NER pipe from the xx model to a blank spaCy language that supports tokenization but doesn't yet have any statistical models for annotating documents? If so, what's the "correct" way to do this? I don't see anything in the docs, but figure using language-specific tokenization is better than multi-lingual defaults.

Which page or section is this issue related to?

https://spacy.io/usage/processing-pipelines

bdewilde commented 5 years ago

Okay, it does seem to be possible —

>>> import spacy
>>> nlp_xx = spacy.load("xx_ent_wiki_sm")
>>> nlp_ar = spacy.blank("ar")
>>> nlp_ar.add_pipe(nlp_xx.get_pipe("ner"), name="ner", last=True)
>>> nlp_ar.pipeline
[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1229cf708>)]

— I'm mostly wondering if this is advisable, and if so, if there's a better way to do it.

ines commented 5 years ago

In theory, this is possible, yes, and your code example would be the way to do it. In this case, there are a few things to consider, though:

bdewilde commented 5 years ago

Makes total sense, thanks Ines! I had a feeling these things might be issues, but somehow convinced myself that it was no big deal. 😅 Will use the multilingual NER pipeline as-is, end-to-end.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.