explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Building an NER pipeline for languages supported by stanza but not spacy #94

Closed rohanchn closed 1 year ago

rohanchn commented 1 year ago

Hi,

I am looking to work on an NER pipeline for Urdu. Currently, spacy doesn't support Urdu but stanza does. I am under the impression that to use spacy-stanza for a language, both libraries must support the language. But then I also saw #35 where the users seems to be using spacy-stanza for Urdu.

Could anyone here please provide some wisdom on using stanza models in spacy for languages that spacy doesn't support?

adrianeboyd commented 1 year ago

spacy does have basic Urdu language support so loading the stanza pipeline should work as described in the spacy-stanza README, just with ur instead of en:

nlp = spacy_stanza.load_pipeline("ur")

(If the stanza language doesn't have basic support in spacy, then you can still load the stanza language as described for Coptic in the first item here: https://github.com/explosion/spacy-stanza#stanza-pipeline-options.)

However it doesn't look like stanza currently has an NER model for Urdu, so you'd need to train your own NER model. If you have an annotated NER corpus, you could train a stanza NER model following the stanza docs: https://stanfordnlp.github.io/stanza/new_language_ner.html

Or you could train a spacy NER model (https://spacy.io/usage/training/#quickstart) and add this component to the nlp pipeline as an additional pipeline component with nlp.add_pipe instead. The spacy course (https://course.spacy.io/en/chapter4) and example projects (e.g., https://github.com/explosion/projects/tree/v3/pipelines/ner_demo) show how to get started with training custom spacy NER models.

rohanchn commented 1 year ago

This is very useful. Thank you for this!

Yes, I intend to train my own NER model.

I am closing this issue for now, and in case I can hit a wall, I will write again. Thanks again!