explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.25k stars 4.32k forks source link

💫 Proposal: A component-based processing pipeline architecture via Doc._, Token._ and Span._ #1381

Closed ines closed 6 years ago

ines commented 6 years ago

Related issues: #1085, #1105, #860 CC: @honnibal, @Liebeck, @christian-storm

Motivation

Custom processing pipelines are a very powerful feature of spaCy that will be able to solve many problems people are currently having when making NLP work for their specific use case. So for spaCy v2.0, we've been working on improving the processing pipelines architecture and extensibility. Fundamentally, a pipeline is a list of functions called on a Doc in order. The pipeline can be set by a model, and modified by the user. A pipeline component can be a complex class that holds state, or a very simple Python function that adds something to a Doc and returns it. However, even with the current state of proposed improvements, the pipelines still aren't perfect and as user-friendly as they should be.

If it's easier to write custom data to the Doc, Token and Span, applications using spaCy will be able to take full advantage of the built-in data structures and the benefits of Doc objects as the single source of truth containing all information. Instead of mixing Doc objects, arrays, plain text and other structures, applications could only pass around Doc objects and read from and write to them whenever necessary.

Having a straightforward API for custom extensions and a clearly defined input/output (Doc/Doc) also helps making larger code bases more maintainable, and allows developers to share their extensions with others and test them reliably. This is relevant for teams working with spaCy, but also for developers looking to publish their own packages, extensions and plugins.

The spaCy philosophy has always been to focus on providing one, best-possible implementation, instead of adopting a "broad church" approach, which makes a lot of sense for research libraries, but can be potentially dangerous for libraries aimed at production use. Going forward, I believe the best future-proof strategy is to direct our efforts at making the processing pipeline more transparent and extensible, and encouraging a community ecosystem of spaCy components to cover any potential use case – no matter how specific. Components could range from simple extensions adding fairly trivial attributes for convenience, to complex models making use of external libraries such as PyTorch, scikit-learn and TensorFlow.

There are many components users may want, and we'd love to be able to offer more built-in pipeline components shipped with spaCy (e.g. SBD, SRL, coref, sentiment). But there's also a clear need for making spaCy extensible for specific use cases, making it interoperate better with other libraries, and putting all of it together to update and train statistical models (the other big issue we're tackling with v2.0).

TL;DR

Why ._?

Letting the user write to a ._ attribute instead of to the Doc directly keeps a clearer separation and makes it easier to ensure backwards compatibility. For example, if you've implemented your own .coref property and spaCy claims it one day, it'll break your code. Similarly, as we have more and more production users with sizable code bases, this solution will make it much easier to tell what's built-in and what's custom. Just by looking at the code, you'll immediately know that doc.sentiment is spaCy, and doc._.sent_score isn't.

Doc._ is shorter and more distinct than Doc.user_data, and for the lack of better options in Python, the _ seems like the best choice. (It's also kinda cute... doc._.doc... once you see the face, you can't unsee it. Just like I'll never be able to read doc.cats as "doc dot categories" again 😺)

Custom pipeline components

Pipeline components can write to a Doc, Span or Token's _ attribute, which is resolved internally via an Underscore class. In the case of Span and Token, this means resolving it relative to the respective indices, as they are only views of a Doc. A pipeline component can hold any state, take the shared Vocab if needed, and implement its own getters and setters.

A component added to the pipeline needs to be a callable that takes a Doc, modifies it and returns it. Here's a simple example of a component wrapper that takes arbitrary settings and assigns "something" to a Doc and Token:

class Something(object):
    name = 'something'

    def __init__(self, vocab, **kwargs):
        self.vocab = vocab
        self.lookup = kwargs.get('lookup_table', {})

    def __call__(self, doc):
        doc._.has_something = False
        for token in doc:
            if token.text in self.lookup:
                token._.something = self.lookup[token.text]
                doc._.has_something = True
        return doc

The custom component could then be initialised and used like this:

from spacy_something import Something
something = Something(nlp.vocab, lookup_table=my_table)
nlp.add_pipe(something, before='ner')

add_pipe() would offer a more convenient way of adding to the pipeline than pipeline.append() or overwriting the pipeline, which easily gets messy, as you have to know the names and order of components, or at least the index at which to insert the new component. The before and after keyword arguments can specify one or more IDs to insert the component before/after (which will be resolved accordingly, and raise an error if the positioning is impossible).

When the pipeline is applied, the custom attribute is available via ._:

doc = nlp(u"A text that contains words in the lookup table")
doc._.has_something
# True

This system would also allow adding custom Doc, Token and Span methods, similar to the built-in similarity().

A model can either require the component package as a dependency, or ship the component code as part of the model package. It can then be added to the pipeline in the model's __init__.py:

def load(**overrides):
    meta = get_model_meta(Path(__file__).parent)
    cls = get_lang_class(meta['lang'])
    nlp = cls(pipeline=meta.get('pipeline', True), meta=meta, **overrides)
    something = Something(nlp.vocab)
    nlp.add_pipe(something, before='ner')
    return nlp.from_disk('/model_data')

Alternatively, a trainable and fully serializable custom pipeline component could also be implemented via the Pipe base class, which is used for spaCy's built-in pipeline components like the tagger, parser and entity recognizer in v2.0.

Going forward, we can even take this architecture one step further and allow other applications to register spaCy pipeline components via entry points, that would make them available via their name.

New classes, methods and properties

Language.pipe_names (property)

Returns a list of pipeline component IDs in order. Useful to check the current pipeline, and determine where to insert custom components.

nlp = spacy.load('en_core_web_sm')
nlp.pipeline
# [<spacy.pipeline.Tensorizer>, <spacy.pipeline.Tagger>, <spacy.pipeline.DependencyParser>, <spacy.pipeline.EntityRecognizer>]
nlp.pipe_names
# ["tensorizer", "tagger", "parser", "ner"]

Language.add_pipe (method)

Add a component to the pipeline.

Argument Type Description
component callable Takes a doc, modifies it and returns it.
name unicode Optional component name. Defaults to component.name.
before unicode / list ID(s) of pipeline component(s) to insert the component before. Raises error if impossible.
after unicode / list ID(s) of pipeline component(s) to insert the component after. Raises error if impossible.
nlp = spacy.load('en')
nlp.add_pipe(custom_component, before='ner')

Language.replace_pipeline (method)

Replace the pipeline.

Argument Type Description
pipeline list List of pipeline components, either built-in / registered IDs or components.
nlp.replace_pipeline(['tensorizer', custom_component, 'ner'])

Underscore (class)

Resolves Doc._, Span._ and Token._ set by the user.

The pipeline component ecosystem

The processing pipeline outlined in this proposal is a good fit for a component-based ecosystem, as pipeline components would have the following features: a lifecycle, an isolated scope and a standardised API.

Component-based ecosystems can be very powerful in driving forward community contributions, while at the same time, keeping the core library focussed and compact. We're obviously happy to integrate third-party components into the core if they're a good fit, but we also want developers to be able to take ownership of their extensions, write spaCy wrappers for their libraries and implement any logic they need quickly, without having to worry about the grand scheme of things.

If you're the maintainer of a library and want to integrate it with spaCy, you'd be able to offer a simple pipeline component your users could plug in and use. Your installation instructions would be as simple as: Install the package, initialise it with your settings and add it to your pipeline using nlp.add_pipe(). Your extension can claim its own ._ namespace on the Doc, Token and Span.

doc._.my_extension
doc._.my_extension_property
doc._.my_extension.some_property
doc._.my_extension.compare(other_doc)

Production users with large code bases would be able to manage their spaCy extensions and utilities as packages that can be developed and integrated into CI workflows independently.

Aside from the obvious use case of implementing models and missing text processing features, there are many other, creative ways in which pipeline component extensions can be utilised – for example:

In terms of the community strategy around this, a possible approach could be:

christian-storm commented 6 years ago

This is incredible and will solve a lot of problems. A lot of great thinking and design decisions. A couple reactions and questions formed by thinking about how I'd apply this to some use cases.

As it stands one can only add to the pipe, e.g., there isn't nlp.disable_pipe or nlp.replace_pipe. How would one solve for the use case of replacing the tokenizer and still have the rest of default spacy pipeline intact? Like this: nlp = spacy.load('en') nlp.replace_pipeline(['custom_tokenizer', 'tensorizer', 'tagger', 'parser', 'ner'])

In general I have to say the current state of this pipeline business is confusing. spacy.load() creates a pipeline. So does Language(). Then of course there is Language().from_disk(). Furthermore, it has always confused me why Language was given as the name for the pipeline. Pipelines IMHO should be at the spacy level not Language level.

To me the lifecycle of a pipeline should be as simple asnlp = spacy.load_pipeline() and nlp = spacy.deserialize_pipeline() to load a working pipeline from disk or memory, nlp.train_pipeline() and nlp.tune_pipeline() for training and fine-tuning, nlp.save_pipeline() andnlp.serialize_pipeline() for saving a pipeline to disk or memory, e.g., store in a cache. I suppose you should have nlp.replace_pipeline() too but that is essentially just nlp=spacy.load_pipeline() or some combination of nlp.add_pipe(), nlp.replace_pipe() and/or nlp.disable_pipe(), right?

This system would also allow adding custom Doc, Token and Span methods, similar to the built-in similarity().

Does that mean user_hooks and its ilk would (hopefully) be jettisoned?

I really like ._ for the fact that it lives at the same level as spacy's variables. As you well know, it flies in the face of PEP though. A bit of mental dissonance and cause for confusion for those of us that abide by that rule. Also, it still allows for collisions amongst extensions. Why not just use a transparent .my_extension namespace? Similar to python's name mangling with __var to prevent namespace collisions. It seems some clever use of __setattr__ and __getattr__ and call signatures could be used for the routing.

ines commented 6 years ago

Thanks for your feedback!

How would one solve for the use case of replacing the tokenizer and still have the rest of default spacy pipeline intact?

The tokenizer is a "special" pipeline component in the sense that it takes a different input – text. That's also the reason it's not part of the pipeline list – conceptually, pipeline components should always be swappable (at least in theory), and since the tokenizer is more of a pre-processor that creates the first Doc object for the pipeline, it gets a bit of special treatment here. However, it's possible to simply overwrite it with a custom function:

nlp.tokenizer = MyCustomTokenizer(nlp.vocab)

We do think that there's a point in keeping those things simple. If something is an object you can overwrite, you should be able to do so. Just like you'll still be able to append stuff to nlp.pipeline – although the additional methods like add_pipe are justified here, since they actually offer additional helpers like the before/after argument, which otherwise would be pretty annoying.

As it stands one can only add to the pipe, e.g., there isn't nlp.disable_pipe or nlp.replace_pipe.

Good point – didn't add this in my proposal, but those methods should definitely exist. Similarly, there should probably be a get_pipe that lets you get a component by its name.

In general I have to say the current state of this pipeline business is confusing. spacy.load() creates a pipeline. So does Language(). Then of course there is Language().from_disk(). Furthermore, it has always confused me why Language was given as the name for the pipeline. Pipelines IMHO should be at the spacy level not Language level.

I hope we'll be able to make this less confusing in the new documentation! I think a lot of the design around this comes down to how the models work under the hood, and how we've been moving towards making them more transparent (since there are now many different models with different features and trade-offs instead of just one "the model").

It probably also doesn't help that the nlp object, i.e. an instance of Language, is sometimes referred to as "the processing pipeline", when in fact, it's the container that holds the language data, access to the model's weights and a reference to a processing pipeline.

A model = weights (binary data) + pipeline + language data. The pipeline applied when you call nlp, i.e. Language on a text often depends on the model, which is why model packages can now define a list of pipeline component names in their meta.json. spacy.load() puts this all together by returning an instance of Language that holds the language data, access to the model data and a reference to the pipeline to apply. So essentially, spacy.load() is a convenience wrapper for:

cls = util.get_lang_class(lang)
nlp = cls(pipeline=pipeline)
nlp.from_disk(model_data_path)

There's actually very little going on at the spacy level except for convenience methods. The Language instance you assign to nlp and pass around the application is what matters and holds state.

Does that mean user_hooks and its ilk would (hopefully) be jettisoned?

I'm personally not a huge fan of the user_hooks, but I also see @honnibal's point that it makes sense to let the user overwrite some built-in methods. The built-in implementation of the similarity() method is fairily arbitrary and just like the vectors, it's one of those cases where it's pretty hard to decide on one ideal, general-purpose implementation. So this was the original motivation of the user_hooks. However, the hooks will follow the same API as pipeline components, so maybe there is a better way to solve this going forward. I wouldn't mind if users simply transitioned to setting those things via their own ._ methods instead (even if we keep support for the user_hooks mechanism).

As you well know, it flies in the face of PEP though. A bit of mental dissonance and cause for confusion for those of us that abide by that rule.

Interesting! I remember talking about this with @honnibal and how other libraries are doing similar things with _ attributes. So are you saying that the _ would be considered an invalid variable name? Take this proof of concept as an example – just tested it, and according to PEP8, this is valid Python:

class Doc(object):
    pass

class Underscore(object):
    pass

Doc._ = Underscore()
Doc._.foo = 'bar'

Or is there something I'm missing?

Also, it still allows for collisions amongst extensions. Why not just use a transparent .my_extension namespace?

You mean Doc.my_extension? As I mentioned above, we're worried about backwards incompatibility and maintainability here, e.g. spaCy claiming a user's custom namespace and breaking extensions, or large code bases becoming harder to read and navigate, since it's less obvious what's custom and what's built-in. The _ would act as a separator between the built-ins and custom extensions (which actually works pretty well, visually). The __setattr__ and __getattr__ approach will definitely come in handy for routing the attributes from the Doc to the Token and Span slices (which don't own any data themselves, so the attributes will have to be re-routed via the Doc to allow the smooth experience of writing to Token._).

Namespace collisions are a valid concern though – but also a problem that many other applications with a third-party extension ecosystem have solved before us. So I'm pretty confident we can find a good solution for this. For developers looking to publish spaCy extensions, the recommended best practices should also include an option to allow the user to overwrite the attributes the extension is setting (which seems pretty standard for plugins in general). This way, if two extensions happen to clash (or if the user disagrees with the developer's naming preferences 😉), it's easy to fix.

christian-storm commented 6 years ago

I think your model = weights (binary data) + pipeline + language data is a great starting point. Then recursively going into each and explaining how their made and manipulated would be great. As you say, it is mostly a documentation and naming issue. Maybe there should be an advanced usage (what is happening under the hood) section for each component for those wanting to delve deeper/train models/create pipelines/etc.? This cls = util.get_lang_class(lang) nlp = cls(pipeline=pipeline) nlp.from_disk(model_data_path) makes your model = weights + pipeline + language data so much clearer.

Having too many options to do the same thing can be confusing.

Interesting! I remember talking about this with @honnibal and how other libraries are doing similar things with _ attributes.

Do you know which ones? I'd be curious to have a look at them.

So are you saying that the _ would be considered an invalid variable name?

I wouldn't go so far as saying that is an invalid variable name. It is just is a special character in python and you are coopting it for another use- defining a public namespace. I guess I don't see how _ is better than, say, Doc.pub = PublicNamespace() # or Doc.p = PublicNamespace() Doc.pub.my_foo = 'bar' # or Doc.p.y_foo = 'bar' Not a huge deal I realize but it is nice to have things consistent. I have no doubt you guys will come up with a good solution.

Looking forward to seeing this in action!

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.