TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link
natural-language-processing nlp nlp-library python spacy udpipe universal-dependencies wrapper-library

spaCy + UDPipe

This package wraps the fast and efficient UDPipe language-agnostic NLP pipeline (via its Python bindings), so you can use UDPipe pre-trained models as a spaCy pipeline for 50+ languages out-of-the-box. Inspired by spacy-stanza, this package offers slightly less accurate models that are in turn much faster (see benchmarks for UDPipe and Stanza).

Installation

Use the package manager pip to install spacy-udpipe.

pip install spacy-udpipe

After installation, use spacy_udpipe.download() to download the pre-trained model for the desired language.

A full list of pre-trained UDPipe models for supported languages can be found in languages.json.

Usage

The loaded UDPipeLanguage class returns a spaCy Language object, i.e., the object you can use to process text and create a Doc object.

import spacy_udpipe

spacy_udpipe.download("en") # download English model

text = "Wikipedia is a free online encyclopedia, created and edited by volunteers around the world."
nlp = spacy_udpipe.load("en")

doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

As all attributes are computed once and set in the custom Tokenizer, the Language.pipeline is empty.

The type of text can be one of the following:

Loading a custom model

The following code snippet demonstrates how to load a custom UDPipe model (for the Croatian language):

import spacy_udpipe

nlp = spacy_udpipe.load_from_path(lang="hr",
                                  path="./custom_croatian.udpipe",
                                  meta={"description": "Custom 'hr' model"})
text = "Wikipedija je enciklopedija slobodnog sadržaja."

doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

This can be done for any of the languages supported by spaCy. For an exhaustive list, see spaCy languages.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update the tests as appropriate. Tests are run automatically for each pull request on the master branch. To start the tests locally, first, install the package with pip install -e .[dev], then run pytest in the root source directory as follows:

make test

Additionally, run flake8 with the following command to check for coding mistakes:

make lint

License

Project status

Maintained by Text Analysis and Knowledge Engineering Lab (TakeLab).

Notes