Error when using qualifiers with spacy-transformers model

louni-g commented 11 months ago

When loading a pipeline from disk, if the pipeline contains a spacy-transformers model and any edsnlp qualifiers this error is encountered:

KeyError: "Parameter 'W' for model 'softmax' has not been allocated yet."

Description

Full Traceback

``` File "/Users/Louise/Library/Application Support/JetBrains/PyCharm2023.2/scratches/scratch.py", line 8, in nlp = spacy.load("nlp") File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/__init__.py", line 51, in load return util.load_model( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 467, in load_model return load_model_from_path(Path(name), **kwargs) # type: ignore[arg-type] File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 539, in load_model_from_path nlp = load_model_from_config( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 587, in load_model_from_config nlp = lang_cls.from_config( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 1864, in from_config nlp.add_pipe( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 821, in add_pipe pipe_component = self.create_pipe( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 709, in create_pipe resolved = registry.resolve(cfg, validate=validate) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/confection/__init__.py", line 756, in resolve resolved, _ = cls._make( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/confection/__init__.py", line 805, in _make filled, _, resolved = cls._fill( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/confection/__init__.py", line 877, in _fill getter_result = getter(*args, **kwargs) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/edsnlp/pipelines/qualifiers/negation/negation.py", line 174, in __init__ super().__init__( File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/edsnlp/pipelines/qualifiers/base.py", line 84, in __init__ self.phrase_matcher.build_patterns(nlp=nlp, terms=terms) File "edsnlp/matchers/phrase.pyx", line 99, in edsnlp.matchers.phrase.EDSPhraseMatcher.build_patterns File "edsnlp/matchers/phrase.pyx", line 111, in edsnlp.matchers.phrase.EDSPhraseMatcher.build_patterns File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/language.py", line 1618, in pipe for doc in docs: File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1685, in _pipe yield from proc.pipe(docs, **kwargs) File "spacy/pipeline/pipe.pyx", line 55, in pipe File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1685, in _pipe yield from proc.pipe(docs, **kwargs) File "spacy/pipeline/transition_parser.pyx", line 245, in pipe File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1632, in minibatch batch = list(itertools.islice(items, int(batch_size))) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1685, in _pipe yield from proc.pipe(docs, **kwargs) File "spacy/pipeline/trainable_pipe.pyx", line 79, in pipe File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/spacy/util.py", line 1704, in raise_error raise e File "spacy/pipeline/trainable_pipe.pyx", line 75, in spacy.pipeline.trainable_pipe.TrainablePipe.pipe File "spacy/pipeline/tagger.pyx", line 138, in spacy.pipeline.tagger.Tagger.predict File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 334, in predict return self._func(self, X, is_train=False)[0] File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/chain.py", line 54, in forward Y, inc_layer_grad = layer(X, is_train=is_train) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 310, in __call__ return self._func(self, X, is_train=is_train) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/with_array.py", line 42, in forward return cast(Tuple[SeqT, Callable], _list_forward(model, Xseq, is_train)) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/with_array.py", line 77, in _list_forward Yf, get_dXf = layer(Xf, is_train) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 310, in __call__ return self._func(self, X, is_train=is_train) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/layers/softmax.py", line 69, in forward W = cast(Floats2d, model.get_param("W")) File "/Users/Louise/Documents/Projects/ai/scracth_venv/lib/python3.9/site-packages/thinc/model.py", line 235, in get_param raise KeyError( KeyError: "Parameter 'W' for model 'softmax' has not been allocated yet." ```

The error occurs during the initialization of the qualifiers, where the token_pipelines are ran in EDSPhraseMatcher's build_patterns. I did a bit of digging and it seems like the error comes from the fact that the spacy-transformers pipelines are not fully initialized at this point so running them raises an error. Possible fixes could be to skip the problematic pipes if they are not necessary to run, or do this step once the whole pipeline has been completely initialized (not in the __init__)

How to reproduce the bug

import spacy

nlp = spacy.load("fr_dep_news_trf")
nlp.add_pipe("sentencizer")
nlp.add_pipe("eds.negation", name="eds_negation")
nlp("Test")   # no problem here 
nlp.to_disk("nlp")
nlp = spacy.load("nlp")  # here is the bug

Your Environment

Operating System: macOS
Python Version Used: 3.9
spaCy Version Used: 3.7.2
EDS-NLP Version Used: 0.9.1
Environment Information:
- spacy-tranformers version: 1.3.2

percevalw commented 11 months ago

Hi, thank you for this detailed feedback ! Indeed, the eds.negation (and any other pipe relying on the EDSPhraseMatcher pipe) applies the same processing to the entries of its term lists as it does to documents. For that, it filters the pipes to keep those that affect the token extensions, and the lemmatizer and morphologizer components declare such changes to tokens:

nlp.get_pipe_meta('morphologizer').assigns
# ['token.morph', 'token.pos']

nlp.get_pipe_meta('morphologizer').assigns
# ['token.lemma']

Ideally,

spaCy should initialize the transformer / decoder pipe before subsquent pipes are added to the pipeline
edsnlp should serialize its pipes to avoid having to rerun the __init__() method (e.g. instead of storing terms, storing the .norm_, .text extensions, ...)

In the meantime,

I will update EDSPhraseMatcher (and its variants) to skip pipes that are clearly not required (as shown by the nlp.get_pipe_meta('morphologizer').assigns attribute) or pipes that are disabled
You can add edsnlp pipes before the transformer `nlp.add_pipe(..., before="transformer"), but this might defeat their purpose

percevalw commented 11 months ago

@louni-g may I ask for what task you need a transformer in your pipeline? is it to use the pre-trained lemmatizer / morphologizer / ... pipes of spacy, or to train a new model, or something else ?

louni-g commented 11 months ago

@louni-g may I ask for what task you need a transformer in your pipeline? is it to use the pre-trained lemmatizer / morphologizer / ... pipes of spacy, or to train a new model, or something else ?

I trained a spacy-transformers NER model and in my case I only have the following pipes: ["transformer", "ner"] and it's the "ner" one that ends up in the token_pipelines:

nlp.get_pipe_meta('ner').assigns
# ['doc.ents', 'token.ent_iob', 'token.ent_type']

so I think it would be a totally ok to skip non necessary pipes 👍

aphp / edsnlp