explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.54k stars 4.35k forks source link

Running pipeline after add_pipe of built-in parser fails while disabling the component works okay #2707

Closed christian-storm closed 5 years ago

christian-storm commented 5 years ago

How to reproduce the behaviour

Running this:

import spacy

nlp = spacy.load('en_core_web_sm')
print(nlp.pipe_names)

with nlp.disable_pipes('parser'):
    print(nlp.pipe_names)
    doc = nlp('this is test.')

nlp.remove_pipe('parser')
print(nlp.pipe_names)
doc = nlp('this is test.')

nlp.add_pipe(nlp.create_pipe('parser'), last=True)
print(nlp.pipe_names)
doc = nlp('this is test.')  # FAILS here

gives

['tagger', 'parser', 'ner']
['tagger', 'ner']
['tagger', 'ner']
['tagger', 'ner', 'parser']  
File "/Users/storm/anaconda/envs/py37/lib/python3.7/site-packages/spacy/language.py", line 352, in __call__
    doc = proc(doc)
  File "nn_parser.pyx", line 338, in spacy.syntax.nn_parser.Parser.__call__
  File "nn_parser.pyx", line 401, in spacy.syntax.nn_parser.Parser.parse_batch
  File "nn_parser.pyx", line 728, in spacy.syntax.nn_parser.Parser.get_batch_model
TypeError: 'bool' object is not iterable

Your Environment

Info about spaCy

ines commented 5 years ago

Thanks for the detailed report! 👍

TypeError: 'bool' object is not iterable

I'm pretty sure this is already fixed in v2.1.x (spacy-nightly). It looks like the problem here is that the component's Model isn't initialized by default – only if you're loading in data via from_disk or from_bytes, or if you're calling begin_training, which will initialize the weights.

christian-storm commented 5 years ago

Drat...I wish I could report that is the case.

I downloaded spacy-nightly with new models[1] and it is still happening with a similar error. However, this time, it jumped to a new method.

  File "nn_parser.pyx", line 188, in spacy.syntax.nn_parser.Parser.__call__
  File "nn_parser.pyx", line 223, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 231, in spacy.syntax.nn_parser.Parser.greedy_parse
TypeError: 'bool' object is not callable

[1] I even did a clean uninstall/install just to be certain.

ines commented 5 years ago

Ah, damn – thanks for testing. In that case, this is definitely something we want to and will fix in spaCy v2.1.0!

christian-storm commented 5 years ago

Looks like parser isn't the only crack in the sidewalk.

import spacy
import traceback
import sys

def run_nlp(nlp, pname):
    txt = "This is a text about a tree kangaroo."
    try:
        d = nlp(txt)
    except (AttributeError, TypeError) as err:
        exc_type, exc_value, exc_traceback = sys.exc_info()
        print("Removing/adding back {} lead to exception '{}'".format(pname, err))
        print("".join(traceback.format_tb(exc_traceback)[-2:]))

def main():
    for pname in ['tagger', 'parser', 'ner']:
        nlp = spacy.load('en_core_web_sm')
        run_nlp(nlp, pname)
        nlp.remove_pipe(pname)
        run_nlp(nlp, pname)
        nlp.add_pipe(nlp.create_pipe(pname))
        run_nlp(nlp, pname)

if __name__ == '__main__':
    main()

leads to

Removing/adding back tagger lead to exception ''bool' object has no attribute 'tok2vec''
  File "pipeline.pyx", line 599, in spacy.pipeline.Tagger.__call__
  File "pipeline.pyx", line 617, in spacy.pipeline.Tagger.predict

Removing/adding back parser lead to exception ''bool' object is not callable'
  File "nn_parser.pyx", line 223, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 231, in spacy.syntax.nn_parser.Parser.greedy_parse

Removing/adding back ner lead to exception ''bool' object is not callable'
  File "nn_parser.pyx", line 223, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 231, in spacy.syntax.nn_parser.Parser.greedy_parse
honnibal commented 5 years ago

@christian-storm Sorry for the delay getting back to you on this. I was confused about this issue the first time I read it.

I actually think there's no problem here, except that the models should probably raise better errors when this occurs. After nlp.create_pipe(), the component is initialised without allocating a model. Model allocation is avoided because in the common case, we'll shortly load a model, so we don't call into the components .Model() classmethod by default. We create the model either during begin_training(), or on from_disk(), or on from_bytes().

So, what you're seeing here is that after add_pipe(), you're adding a component without an allocated model. I do think this should be an error: if we had allocated a model by default, instead you'd be seeing predictions with random weights, which to me is a much worse behaviour.

christian-storm commented 5 years ago

@honnibal Thanks for responding to this with your thoughts. I have a different viewpoint that I'd like to offer you. I think some context of how Spacy is being used will help illustrate why you might reconsider your stance on behalf of us 'power users'.

I've built a configurable processing pipeline that encapsulates Spacy's pipeline. It adds static configurability in YAML and on the fly (per task, per module) configurability, a NLP singleton class, and lazy model (re)loading granting us reproducible experiments, extensible/modular code, and adherence to the performance axiom of 'only load and run what you need.' For instance, I've built a catalog of sentence segmenters, tokenizers, parsers, ... using third party libraries, Punkt, corenlp, ..., and our own code. It has afforded us quick prototyping of new ideas, rapid incorporation of third party libraries, and an enhanced ability to reproduce outside research/results using our own data stores.

From the documentation, I thought it possible to start with either a blank model or the spacy components required by the calling function that triggers the initialization of the pipeline. For example, nlp = spacy.load('en', disable=['parser', 'tagger']). Further down the line, components can be added/subtracted/swapped out/have their kwargs changed to match the requirements of the task at hand, e.g., nlp.create_pipe('parser') followed by nlp.add_pipe('parser'). This is what breaks.

For my components that require a model, the model is loaded in init (aka .create_pipe() call) and I take it a step further by allowing a different model to be loaded in situ if the calling kwargs dictate it. At least the create_pipe() part is the same receipt how spacy universe modules are expected to act, e.g., spacy-kenlm. I've found this encapsulation very flexible and extensible while maintaining a fixed API/development abstraction just like with your pipeline API and has been a huge productivity booster. The only way to do this is with spacy's builtins is load the full model and then call .remove_pipe() on the unwanted components. Not the end of the world, I know, but it breaks the abstraction laid out in the documentation. If you are dissuaded by my argument, you should add a big caveat to load() stating that some disabled components can't be added later; you must reload the model from scratch.

After nlp.create_pipe(), the component is initialised without allocating a model. Model allocation is avoided because in the common case, we'll shortly load a model, so we don't call into the components .Model() classmethod by default. We create the model either during begin_training(), or on from_disk(), or on from_bytes().

I believe the common use case you are referring to are those scripting folks out there that just .load() and run from there. However, isn't your intent to support those of us in industry with larger architectures where this configurability is a top feature of Spacy 2? After our (my) long github musing on the subject I decided to use spacy as the platform to build this. Except for this snag, it has been working out beautifully and am indebted to Ines, yourself, and community at large for a beautifully designed nlp system.

Are you opposed to having the Spacy components call the their .Model() method or, as a second rate solution, have load() load everything and then call remove_pipe() on the disabled components? For my components, I use a singleton type approach where the model is cached in a private variable and is only loaded if it hasn't been loaded yet or the pipe kwargs have changed.

Thanks again for listening and hearing me out.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.