explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.09k stars 4.4k forks source link

It's sometimes difficult to initialize pipeline components in code #7027

Open honnibal opened 3 years ago

honnibal commented 3 years ago

The workflow for setting up a pipeline component in code sometimes feels a bit rough. This came up while I was investigating #6958.

Let's say we have some pipeline component that assumes its .initialize() method will be called before it's in a valid state, as the transformer does --- but the component doesn't necessarily need to be trained, as such, before it's in a functional state. We have the following:


import spacy

nlp = spacy.blank("en")
transformer = nlp.add_pipe("transformer")

So now we need to call transformer.initialize(). How to do that?

A quick improvement is to add an argument to validate_get_examples indicating whether the component can work with no examples. I'm not sure how to help components that do need some data though.

Maybe some components should check whether they're initialized, and do that on first usage if necessary? It does feel dirty, though.

adrianeboyd commented 3 years ago

There is the same issue for the lemmatizer with its lookup tables. It doesn't call validate_get_examples, though, it just ignores it, so you can call nlp.get_pipe("lemmatizer").initialize(). The warning isn't helpful if you're substituting a lemmatizer in an existing pipeline because it says to call nlp.initialize(), which is going to wipe out everything else.

Why is transformers validating the examples if it's not using them?