Closed ines closed 6 years ago
This is incredible and will solve a lot of problems. A lot of great thinking and design decisions. A couple reactions and questions formed by thinking about how I'd apply this to some use cases.
As it stands one can only add to the pipe, e.g., there isn't nlp.disable_pipe
or nlp.replace_pipe
. How would one solve for the use case of replacing the tokenizer and still have the rest of default spacy pipeline intact? Like this:
nlp = spacy.load('en') nlp.replace_pipeline(['custom_tokenizer', 'tensorizer', 'tagger', 'parser', 'ner'])
In general I have to say the current state of this pipeline business is confusing. spacy.load()
creates a pipeline. So does Language()
. Then of course there is Language().from_disk()
. Furthermore, it has always confused me why Language
was given as the name for the pipeline. Pipelines IMHO should be at the spacy level not Language level.
To me the lifecycle of a pipeline should be as simple asnlp = spacy.load_pipeline()
and nlp = spacy.deserialize_pipeline()
to load a working pipeline from disk or memory, nlp.train_pipeline()
and nlp.tune_pipeline()
for training and fine-tuning, nlp.save_pipeline()
andnlp.serialize_pipeline()
for saving a pipeline to disk or memory, e.g., store in a cache. I suppose you should have nlp.replace_pipeline()
too but that is essentially just nlp=spacy.load_pipeline()
or some combination of nlp.add_pipe()
, nlp.replace_pipe()
and/or nlp.disable_pipe()
, right?
This system would also allow adding custom Doc, Token and Span methods, similar to the built-in similarity().
Does that mean user_hooks and its ilk would (hopefully) be jettisoned?
I really like ._ for the fact that it lives at the same level as spacy's variables. As you well know, it flies in the face of PEP though. A bit of mental dissonance and cause for confusion for those of us that abide by that rule. Also, it still allows for collisions amongst extensions. Why not just use a transparent .my_extension namespace? Similar to python's name mangling with __var to prevent namespace collisions. It seems some clever use of __setattr__
and __getattr__
and call signatures could be used for the routing.
Thanks for your feedback!
How would one solve for the use case of replacing the tokenizer and still have the rest of default spacy pipeline intact?
The tokenizer is a "special" pipeline component in the sense that it takes a different input – text. That's also the reason it's not part of the pipeline
list – conceptually, pipeline components should always be swappable (at least in theory), and since the tokenizer is more of a pre-processor that creates the first Doc
object for the pipeline, it gets a bit of special treatment here. However, it's possible to simply overwrite it with a custom function:
nlp.tokenizer = MyCustomTokenizer(nlp.vocab)
We do think that there's a point in keeping those things simple. If something is an object you can overwrite, you should be able to do so. Just like you'll still be able to append stuff to nlp.pipeline
– although the additional methods like add_pipe
are justified here, since they actually offer additional helpers like the before
/after
argument, which otherwise would be pretty annoying.
As it stands one can only add to the pipe, e.g., there isn't
nlp.disable_pipe
ornlp.replace_pipe
.
Good point – didn't add this in my proposal, but those methods should definitely exist. Similarly, there should probably be a get_pipe
that lets you get a component by its name.
In general I have to say the current state of this pipeline business is confusing.
spacy.load()
creates a pipeline. So doesLanguage()
. Then of course there is Language().from_disk(). Furthermore, it has always confused me whyLanguage
was given as the name for the pipeline. Pipelines IMHO should be at the spacy level notLanguage
level.
I hope we'll be able to make this less confusing in the new documentation! I think a lot of the design around this comes down to how the models work under the hood, and how we've been moving towards making them more transparent (since there are now many different models with different features and trade-offs instead of just one "the model").
It probably also doesn't help that the nlp
object, i.e. an instance of Language
, is sometimes referred to as "the processing pipeline", when in fact, it's the container that holds the language data, access to the model's weights and a reference to a processing pipeline.
A model = weights (binary data) + pipeline + language data. The pipeline applied when you call nlp
, i.e. Language
on a text often depends on the model, which is why model packages can now define a list of pipeline component names in their meta.json
. spacy.load()
puts this all together by returning an instance of Language
that holds the language data, access to the model data and a reference to the pipeline to apply. So essentially, spacy.load()
is a convenience wrapper for:
cls = util.get_lang_class(lang)
nlp = cls(pipeline=pipeline)
nlp.from_disk(model_data_path)
There's actually very little going on at the spacy
level except for convenience methods. The Language
instance you assign to nlp
and pass around the application is what matters and holds state.
Does that mean user_hooks and its ilk would (hopefully) be jettisoned?
I'm personally not a huge fan of the user_hooks
, but I also see @honnibal's point that it makes sense to let the user overwrite some built-in methods. The built-in implementation of the similarity()
method is fairily arbitrary and just like the vectors, it's one of those cases where it's pretty hard to decide on one ideal, general-purpose implementation. So this was the original motivation of the user_hooks
. However, the hooks will follow the same API as pipeline components, so maybe there is a better way to solve this going forward. I wouldn't mind if users simply transitioned to setting those things via their own ._
methods instead (even if we keep support for the user_hooks
mechanism).
As you well know, it flies in the face of PEP though. A bit of mental dissonance and cause for confusion for those of us that abide by that rule.
Interesting! I remember talking about this with @honnibal and how other libraries are doing similar things with _
attributes. So are you saying that the _
would be considered an invalid variable name? Take this proof of concept as an example – just tested it, and according to PEP8, this is valid Python:
class Doc(object):
pass
class Underscore(object):
pass
Doc._ = Underscore()
Doc._.foo = 'bar'
Or is there something I'm missing?
Also, it still allows for collisions amongst extensions. Why not just use a transparent .my_extension namespace?
You mean Doc.my_extension
? As I mentioned above, we're worried about backwards incompatibility and maintainability here, e.g. spaCy claiming a user's custom namespace and breaking extensions, or large code bases becoming harder to read and navigate, since it's less obvious what's custom and what's built-in. The _
would act as a separator between the built-ins and custom extensions (which actually works pretty well, visually). The __setattr__
and __getattr__
approach will definitely come in handy for routing the attributes from the Doc
to the Token
and Span
slices (which don't own any data themselves, so the attributes will have to be re-routed via the Doc
to allow the smooth experience of writing to Token._
).
Namespace collisions are a valid concern though – but also a problem that many other applications with a third-party extension ecosystem have solved before us. So I'm pretty confident we can find a good solution for this. For developers looking to publish spaCy extensions, the recommended best practices should also include an option to allow the user to overwrite the attributes the extension is setting (which seems pretty standard for plugins in general). This way, if two extensions happen to clash (or if the user disagrees with the developer's naming preferences 😉), it's easy to fix.
I think your model = weights (binary data) + pipeline + language data is a great starting point. Then recursively going into each and explaining how their made and manipulated would be great. As you say, it is mostly a documentation and naming issue. Maybe there should be an advanced usage (what is happening under the hood) section for each component for those wanting to delve deeper/train models/create pipelines/etc.?
This
cls = util.get_lang_class(lang) nlp = cls(pipeline=pipeline) nlp.from_disk(model_data_path)
makes your model = weights + pipeline + language data so much clearer.
Having too many options to do the same thing can be confusing.
Interesting! I remember talking about this with @honnibal and how other libraries are doing similar things with _ attributes.
Do you know which ones? I'd be curious to have a look at them.
So are you saying that the _ would be considered an invalid variable name?
I wouldn't go so far as saying that is an invalid variable name. It is just is a special character in python and you are coopting it for another use- defining a public namespace. I guess I don't see how _ is better than, say,
Doc.pub = PublicNamespace() # or Doc.p = PublicNamespace() Doc.pub.my_foo = 'bar' # or Doc.p.y_foo = 'bar'
Not a huge deal I realize but it is nice to have things consistent. I have no doubt you guys will come up with a good solution.
Looking forward to seeing this in action!
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Related issues: #1085, #1105, #860 CC: @honnibal, @Liebeck, @christian-storm
Motivation
Custom processing pipelines are a very powerful feature of spaCy that will be able to solve many problems people are currently having when making NLP work for their specific use case. So for spaCy v2.0, we've been working on improving the processing pipelines architecture and extensibility. Fundamentally, a pipeline is a list of functions called on a
Doc
in order. The pipeline can be set by a model, and modified by the user. A pipeline component can be a complex class that holds state, or a very simple Python function that adds something to aDoc
and returns it. However, even with the current state of proposed improvements, the pipelines still aren't perfect and as user-friendly as they should be.If it's easier to write custom data to the
Doc
,Token
andSpan
, applications using spaCy will be able to take full advantage of the built-in data structures and the benefits ofDoc
objects as the single source of truth containing all information. Instead of mixingDoc
objects, arrays, plain text and other structures, applications could only pass aroundDoc
objects and read from and write to them whenever necessary.Having a straightforward API for custom extensions and a clearly defined input/output (
Doc
/Doc
) also helps making larger code bases more maintainable, and allows developers to share their extensions with others and test them reliably. This is relevant for teams working with spaCy, but also for developers looking to publish their own packages, extensions and plugins.The spaCy philosophy has always been to focus on providing one, best-possible implementation, instead of adopting a "broad church" approach, which makes a lot of sense for research libraries, but can be potentially dangerous for libraries aimed at production use. Going forward, I believe the best future-proof strategy is to direct our efforts at making the processing pipeline more transparent and extensible, and encouraging a community ecosystem of spaCy components to cover any potential use case – no matter how specific. Components could range from simple extensions adding fairly trivial attributes for convenience, to complex models making use of external libraries such as PyTorch, scikit-learn and TensorFlow.
There are many components users may want, and we'd love to be able to offer more built-in pipeline components shipped with spaCy (e.g. SBD, SRL, coref, sentiment). But there's also a clear need for making spaCy extensible for specific use cases, making it interoperate better with other libraries, and putting all of it together to update and train statistical models (the other big issue we're tackling with v2.0).
TL;DR
Doc._
,Token._
andSpan._
attribute users can write to, choosing any custom namespace.Underscore
class will wire it all together and resolve the custom properties for tokens and spans, which are only views of theDoc
.Language.add_pipe
method to add pipeline components, with options to specify the pipeline IDs to add the component before/after, and aLanguage.replace_pipeline
method to replace the entire pipeline.Language.pipe_names
property that returns a list of the pipeline IDs (e.g.['tensorizer', 'ner']
) as a human-readable version ofLanguage.pipeline
.Pipe
base class used by spaCy for its built-in components like the tagger, parser and entity recognizer.Why
._
?Letting the user write to a
._
attribute instead of to theDoc
directly keeps a clearer separation and makes it easier to ensure backwards compatibility. For example, if you've implemented your own.coref
property and spaCy claims it one day, it'll break your code. Similarly, as we have more and more production users with sizable code bases, this solution will make it much easier to tell what's built-in and what's custom. Just by looking at the code, you'll immediately know thatdoc.sentiment
is spaCy, anddoc._.sent_score
isn't.Doc._
is shorter and more distinct thanDoc.user_data
, and for the lack of better options in Python, the_
seems like the best choice. (It's also kinda cute...doc._.doc
... once you see the face, you can't unsee it. Just like I'll never be able to readdoc.cats
as "doc dot categories" again 😺)Custom pipeline components
Pipeline components can write to a
Doc
,Span
orToken
's_
attribute, which is resolved internally via anUnderscore
class. In the case ofSpan
andToken
, this means resolving it relative to the respective indices, as they are only views of aDoc
. A pipeline component can hold any state, take the sharedVocab
if needed, and implement its own getters and setters.A component added to the pipeline needs to be a callable that takes a
Doc
, modifies it and returns it. Here's a simple example of a component wrapper that takes arbitrary settings and assigns "something" to aDoc
andToken
:The custom component could then be initialised and used like this:
add_pipe()
would offer a more convenient way of adding to the pipeline thanpipeline.append()
or overwriting the pipeline, which easily gets messy, as you have to know the names and order of components, or at least the index at which to insert the new component. Thebefore
andafter
keyword arguments can specify one or more IDs to insert the component before/after (which will be resolved accordingly, and raise an error if the positioning is impossible).When the pipeline is applied, the custom attribute is available via
._
:This system would also allow adding custom
Doc
,Token
andSpan
methods, similar to the built-insimilarity()
.A model can either require the component package as a dependency, or ship the component code as part of the model package. It can then be added to the pipeline in the model's
__init__.py
:Alternatively, a trainable and fully serializable custom pipeline component could also be implemented via the
Pipe
base class, which is used for spaCy's built-in pipeline components like the tagger, parser and entity recognizer in v2.0.Going forward, we can even take this architecture one step further and allow other applications to register spaCy pipeline components via entry points, that would make them available via their name.
New classes, methods and properties
Language.pipe_names
(property)Returns a list of pipeline component IDs in order. Useful to check the current pipeline, and determine where to insert custom components.
Language.add_pipe
(method)Add a component to the pipeline.
component
name
component.name
.before
after
Language.replace_pipeline
(method)Replace the pipeline.
pipeline
Underscore
(class)Resolves
Doc._
,Span._
andToken._
set by the user.The pipeline component ecosystem
The processing pipeline outlined in this proposal is a good fit for a component-based ecosystem, as pipeline components would have the following features: a lifecycle, an isolated scope and a standardised API.
Component-based ecosystems can be very powerful in driving forward community contributions, while at the same time, keeping the core library focussed and compact. We're obviously happy to integrate third-party components into the core if they're a good fit, but we also want developers to be able to take ownership of their extensions, write spaCy wrappers for their libraries and implement any logic they need quickly, without having to worry about the grand scheme of things.
If you're the maintainer of a library and want to integrate it with spaCy, you'd be able to offer a simple pipeline component your users could plug in and use. Your installation instructions would be as simple as: Install the package, initialise it with your settings and add it to your pipeline using
nlp.add_pipe()
. Your extension can claim its own._
namespace on theDoc
,Token
andSpan
.Production users with large code bases would be able to manage their spaCy extensions and utilities as packages that can be developed and integrated into CI workflows independently.
Aside from the obvious use case of implementing models and missing text processing features, there are many other, creative ways in which pipeline component extensions can be utilised – for example:
In terms of the community strategy around this, a possible approach could be:
spacy-contrib
orspacy-extensions
package like some other libraries do would be a good solution for us. Extensions are very specific, and users shouldn't have to install a bunch of stuff they don't need just to use one particular component. Versioning packages like this is also a nightmare. Similarly, I don't think we should make people submit them to an "official" repository – if someone made a spaCy extension, they should be able to showcase their work on their own GitHub profiles.