spacy train from prodigy data-to-spacy config with en_core_web_trf yields ValueError: Cannot deserialize model: mismatched structure

cbjrobertson commented 1 year ago

I'm training a spaCy multi-label text classification model using en_core_web_trf from spaCy transformers. The data and config file I use are generated through a prodigy data-to-spacy call. The issue is that when I try to reload the model using spacy.load("path/to/mod"), it returns: ValueError: Cannot deserialize model: mismatched structure. Based on the prodigy forums, this ought to have been fixed with spacy-transformers version 1.0.6 but I'm running spacy-transformers==1.1.8 so I believe there's still a bug somewhere, likely in spacy-transformers, see #8566. I can't share the data, but I'll do my best to make the issue reproducible.

How to reproduce the behaviour (from a notebook):

from spacy.cli.train import train

out_path = "./model/out/path"

!prodigy data-to-spacy $out_path --textcat-multilabel dataset_name --base-model en_core_web_trf --eval-split 0.2 

train(f"{out_path}/config.cfg", 
      out_path,
      use_gpu = 1,
      overrides={"paths.train" : f"{out_path}train.spacy", 
                 "paths.dev" : f"{out_path}dev.spacy"
                }
     )

nlp = spacy.load(out_path)
>>> ValueError: Cannot deserialize model: mismatched structure

On the basis of this advice, this error can be worked around in either of two ways:

Change the call to spacy.load(out_path) to spacy.load(out_path,disable="tagger,parser,attribute_ruler,lemmatizer,ner")
Edit line 13 of out_path/config.cfg from pipeline = ["transformer","tagger","parser","attribute_ruler","lemmatizer","ner","textcat_multilabel"] to pipeline = ["transformer","textcat_multilabel"]

However, neither of these are ideal. For instance, getting these models to work with spacy-report requires editing the source code of that package. It seems there's still a bug relating to the frozen components in the call to spacy.train!

Here is an example observation from `dataset_name`:

{'_input_hash': -849869852,
 '_task_hash': -473741752,
 'answer': 'reject',
 'label': 'MY_LABEL',
 'text': "Foo to the bar.'}

Your Environment

Operating System: Ubuntu 20.04.5 LTS
Python Version Used: 3.9.2
spaCy Version Used: 3.4.3
Output of spacy validate:

================= Installed pipeline packages (spaCy v3.4.3) =================
ℹ spaCy installation:
/home/coler/anaconda3/envs/prodigy/lib/python3.9/site-packages/spacy

NAME              SPACY            VERSION                            
en_core_web_trf   >=3.4.1,<3.5.0   3.4.1   ✔
en_core_web_lg    >=3.4.0,<3.5.0   3.4.1   ✔

Environment Information (pip freeze):

aiofiles==22.1.0
altair==4.2.0
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
attrs==22.1.0
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
blis==0.7.9
Bottleneck @ file:///opt/conda/conda-bld/bottleneck_1657175564434/work
brotlipy @ file:///home/conda/feedstock_root/build_artifacts/brotlipy_1648854164373/work
cachetools==5.2.0
catalogue @ file:///home/conda/feedstock_root/build_artifacts/catalogue_1666892137682/work
certifi @ file:///croot/certifi_1665076670883/work/certifi
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1625835307225/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
click==8.0.4
clumper==0.2.15
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work
commonmark==0.9.1
confection==0.0.3
contourpy==1.0.6
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1652967134882/work
cupy @ file:///home/conda/feedstock_root/build_artifacts/cupy_1665150609782/work
cycler==0.11.0
cymem @ file:///home/conda/feedstock_root/build_artifacts/cymem_1649412201293/work
dataclasses @ file:///home/conda/feedstock_root/build_artifacts/dataclasses_1628958434797/work
debugpy @ file:///tmp/build/80754af9/debugpy_1637091799509/work
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.1/en_core_web_trf-3.4.1-py3-none-any.whl
entrypoints @ file:///tmp/build/80754af9/entrypoints_1649926439650/work
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
fastapi==0.68.2
fastrlock==0.8
filelock==3.8.0
fonttools==4.38.0
h11==0.14.0
huggingface-hub==0.0.12
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
ipykernel @ file:///opt/conda/conda-bld/ipykernel_1662361798230/work
ipython @ file:///opt/conda/conda-bld/ipython_1657652213665/work
ipywidgets==8.0.2
jedi @ file:///tmp/build/80754af9/jedi_1644297102865/work
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1654302431367/work
joblib==1.2.0
jsonschema==4.17.0
jupyter_client @ file:///opt/conda/conda-bld/jupyter_client_1662504365333/work
jupyter_core @ file:///opt/conda/conda-bld/jupyter_core_1664917302524/work
jupyterlab-widgets==3.0.3
kiwisolver==1.4.4
langcodes @ file:///home/conda/feedstock_root/build_artifacts/langcodes_1636741340529/work
MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1648737556467/work
matplotlib==3.6.2
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
murmurhash @ file:///home/conda/feedstock_root/build_artifacts/murmurhash_1651135363901/work
nest-asyncio @ file:///tmp/build/80754af9/nest-asyncio_1649847906199/work
nltk==3.7
numexpr @ file:///opt/conda/conda-bld/numexpr_1656940300424/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1651020388495/work
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
packaging @ file:///tmp/build/80754af9/packaging_1637314298585/work
pandas==1.4.4
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathy @ file:///home/conda/feedstock_root/build_artifacts/pathy_1656568808184/work
patsy==0.5.3
peewee==3.15.3
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.3.0
plac==1.1.3
preshed @ file:///home/conda/feedstock_root/build_artifacts/preshed_1649427193882/work
prodigy==1.11.7
prompt-toolkit @ file:///tmp/build/80754af9/prompt-toolkit_1633440160888/work
psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1636257122734/work
pydantic @ file:///home/conda/feedstock_root/build_artifacts/pydantic_1636021143633/work
Pygments @ file:///opt/conda/conda-bld/pygments_1644249106324/work
PyJWT==2.6.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1663846997386/work
pyparsing @ file:///opt/conda/conda-bld/pyparsing_1661452539315/work
pyrsistent==0.19.2
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
pytz @ file:///opt/conda/conda-bld/pytz_1654762638606/work
PyYAML==6.0
pyzmq @ file:///opt/conda/conda-bld/pyzmq_1657724186960/work
regex==2022.9.13
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1661872987712/work
rich==12.6.0
sacremoses==0.0.53
scikit-learn==1.1.3
scipy==1.9.3
seaborn==0.12.1
shellingham @ file:///home/conda/feedstock_root/build_artifacts/shellingham_1659638615822/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smart-open @ file:///home/conda/feedstock_root/build_artifacts/smart_open_1630238320325/work
spacy==3.4.3
spacy-alignments==0.8.6
spacy-legacy @ file:///home/conda/feedstock_root/build_artifacts/spacy-legacy_1660748275723/work
spacy-loggers @ file:///home/conda/feedstock_root/build_artifacts/spacy-loggers_1661365735520/work
spacy-lookups-data==1.0.3
spacy-report==0.1.1
spacy-transformers==1.1.8
srsly @ file:///home/conda/feedstock_root/build_artifacts/srsly_1649923429296/work
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
starlette==0.14.2
statsmodels==0.13.5
thinc==8.1.5
threadpoolctl==3.1.0
tokenizers==0.10.3
toolz==0.12.0
torch==1.13.0
tornado @ file:///opt/conda/conda-bld/tornado_1662061693373/work
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1662214488106/work
traitlets @ file:///tmp/build/80754af9/traitlets_1636710298902/work
transformers==4.9.2
typer @ file:///home/conda/feedstock_root/build_artifacts/typer_1657029164904/work
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1665144421445/work
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1658789158161/work
uvicorn==0.13.4
wasabi @ file:///home/conda/feedstock_root/build_artifacts/wasabi_1658931821849/work
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
widgetsnbextension==4.0.3

config.cfg file:

[paths]
train = "./new_mods/corpus/train.spacy"
dev = "./new_mods/corpus/dev.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","parser","attribute_ruler","lemmatizer","ner","textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 64
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.ner.model.tok2vec.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner.model.tok2vec.grad_scaler_config]

[components.ner.model.tok2vec.tokenizer_config]
use_fast = true

[components.ner.model.tok2vec.transformer_config]

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.parser.model.tok2vec.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.parser.model.tok2vec.grad_scaler_config]

[components.parser.model.tok2vec.tokenizer_config]
use_fast = true

[components.parser.model.tok2vec.transformer_config]

[components.tagger]
factory = "tagger"
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.Tok2VecTransformer.v3"
name = "roberta-base"
mixed_precision = false
pooling = {"@layers":"reduce_mean.v1"}
grad_factor = 1.0

[components.tagger.model.tok2vec.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.tagger.model.tok2vec.grad_scaler_config]

[components.tagger.model.tok2vec.tokenizer_config]
use_fast = true

[components.tagger.model.tok2vec.transformer_config]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 3
patience = 5000
max_epochs = 0
max_steps = 20000
eval_frequency = 1000
frozen_components = ["tagger","parser","attribute_ruler","lemmatizer","ner"]
before_to_disk = null
annotating_components = []

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
get_length = null
size = 2000
buffer = 256

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
tag_acc = null
dep_uas = null
dep_las = null
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = null
lemma_acc = null
ents_f = null
ents_p = null
ents_r = null
ents_per_type = null
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
speed = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "new_mods/corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "new_mods/corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "new_mods/corpus/labels/tagger.json"
require = false

[initialize.components.textcat_multilabel]

[initialize.components.textcat_multilabel.labels]
@readers = "spacy.read_labels.v1"
path = "new_mods/corpus/labels/textcat_multilabel.json"
require = false

[initialize.tokenizer]

adrianeboyd commented 1 year ago

Let me first back up a step: your goal is to have a final pipeline with everything from en_core_web_trf + your new textcat component (non-transformer-based) trained on your data from prodigy?

If that's the case, then you can train the textcat model separately and "assemble" the final pipeline as the last step. It would look like this:

prodigy data-to-spacy out/ --textcat-multilabel dataset_name --eval-split 0.2
spacy train out/config.cfg --paths.train out/train.spacy --paths.dev out/dev.spacy -o training/

And then assemble. You can write a config to do with spacy assemble, but it's easier to do it programmatically:

import spacy
nlp = spacy.load("en_core_web_trf")
tcm_nlp = spacy.load("training/model-best")
tcm_nlp.replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
nlp.add_pipe("textcat_multilabel", source=tcm_nlp)
nlp.to_disk("/path/to/my_combined_pipeline")

In addition, the prodigy config defaults that you get with data-to-spacy are for the faster BOW-only textcat architecture, when you may see better performance with the ensemble classifier. The config could look like this:

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

(Ideally you'd be able to generate this with spacy init config -p textcat_multilabel, but the default options only allow for BOW (-o efficiency) or ensemble+static vectors (-o accuracy). The version above is modified by hand from -o accuracy to disable static vectors, which you don't have in the en_core_web_trf pipeline. If you used en_core_web_lg instead, you could keep the static vectors enabled.)

cbjrobertson commented 1 year ago

No, that's not what I want to do. What I was trying to do in the above code was simply re-produce the prodigy training procedure using spaCy. I ran into the bug above and wanted to flag it.

Perhaps this conversion would be better suited to spaCy and/or prodigy forums, but what I intend to do is train a textcat ensemble model which combines TextCatBOW with a transformer-based embedding layer with non-static vectors, i.e. I want to fine tune the transformer embeddings, if that's possible. Whereas your example config.cfg does not use transformer-based embeddings, if I'm reading it correctly. Any advice?

As an aside, given what you've mentioned above, what is the difference between calls to prodigy train textcat when --base-model is set to en_core_web_trf as compared to en_core_web_lg? From your explanation above, it sounds the classification layer is identical (i.e. spacy.TextCatBOW.v2). Is that really true?

Additionally, the code you suggest doesn't work. It fails with:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [13], in <cell line: 5>()
      2 nlp = spacy.load("en_core_web_trf")
      3 tcm_nlp = spacy.load(f"./{EXP_NAME}/model/model-best")
----> 5 tcm_nlp.replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
      6 nlp.add_pipe("textcat_multilabel", source=tcm_nlp)
      7 nlp.to_disk(f"./{EXP_NAME}/test.cfg")

File ~/anaconda3/envs/prodigy/lib/python3.9/site-packages/spacy/language.py:1969, in Language.replace_listeners(self, tok2vec_name, pipe_name, listeners)
   1962 if tok2vec_name not in self.pipe_names:
   1963     err = Errors.E889.format(
   1964         tok2vec=tok2vec_name,
   1965         name=pipe_name,
   1966         unknown=tok2vec_name,
   1967         opts=", ".join(self.pipe_names),
   1968     )
-> 1969     raise ValueError(err)
   1970 if pipe_name not in self.pipe_names:
   1971     err = Errors.E889.format(
   1972         tok2vec=tok2vec_name,
   1973         name=pipe_name,
   1974         unknown=pipe_name,
   1975         opts=", ".join(self.pipe_names),
   1976     )

ValueError: [E889] Can't replace 'tok2vec' listeners of component 'textcat_multilabel' because 'tok2vec' is not in the pipeline. Available components: textcat_multilabel. If you didn't call nlp.replace_listeners manually, this is likely a bug in spaCy.

adrianeboyd commented 1 year ago

Let me convert this to a discussion...

explosion / spaCy