Closed cbjrobertson closed 1 year ago
Let me first back up a step: your goal is to have a final pipeline with everything from en_core_web_trf
+ your new textcat component (non-transformer-based) trained on your data from prodigy?
If that's the case, then you can train the textcat model separately and "assemble" the final pipeline as the last step. It would look like this:
prodigy data-to-spacy out/ --textcat-multilabel dataset_name --eval-split 0.2
spacy train out/config.cfg --paths.train out/train.spacy --paths.dev out/dev.spacy -o training/
And then assemble. You can write a config to do with spacy assemble
, but it's easier to do it programmatically:
import spacy
nlp = spacy.load("en_core_web_trf")
tcm_nlp = spacy.load("training/model-best")
tcm_nlp.replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
nlp.add_pipe("textcat_multilabel", source=tcm_nlp)
nlp.to_disk("/path/to/my_combined_pipeline")
In addition, the prodigy config defaults that you get with data-to-spacy
are for the faster BOW-only textcat architecture, when you may see better performance with the ensemble classifier. The config could look like this:
[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5
[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null
[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3
(Ideally you'd be able to generate this with spacy init config -p textcat_multilabel
, but the default options only allow for BOW (-o efficiency
) or ensemble+static vectors (-o accuracy
). The version above is modified by hand from -o accuracy
to disable static vectors, which you don't have in the en_core_web_trf
pipeline. If you used en_core_web_lg
instead, you could keep the static vectors enabled.)
No, that's not what I want to do. What I was trying to do in the above code was simply re-produce the prodigy
training procedure using spaCy
. I ran into the bug above and wanted to flag it.
Perhaps this conversion would be better suited to spaCy
and/or prodigy
forums, but what I intend to do is train a textcat ensemble model which combines TextCatBOW
with a transformer-based embedding layer with non-static vectors, i.e. I want to fine tune the transformer embeddings, if that's possible. Whereas your example config.cfg
does not use transformer-based embeddings, if I'm reading it correctly. Any advice?
As an aside, given what you've mentioned above, what is the difference between calls to prodigy train textcat
when --base-model
is set to en_core_web_trf
as compared to en_core_web_lg
? From your explanation above, it sounds the classification layer is identical (i.e. spacy.TextCatBOW.v2
). Is that really true?
Additionally, the code you suggest doesn't work. It fails with:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [13], in <cell line: 5>()
2 nlp = spacy.load("en_core_web_trf")
3 tcm_nlp = spacy.load(f"./{EXP_NAME}/model/model-best")
----> 5 tcm_nlp.replace_listeners("tok2vec", "textcat_multilabel", ["model.tok2vec"])
6 nlp.add_pipe("textcat_multilabel", source=tcm_nlp)
7 nlp.to_disk(f"./{EXP_NAME}/test.cfg")
File ~/anaconda3/envs/prodigy/lib/python3.9/site-packages/spacy/language.py:1969, in Language.replace_listeners(self, tok2vec_name, pipe_name, listeners)
1962 if tok2vec_name not in self.pipe_names:
1963 err = Errors.E889.format(
1964 tok2vec=tok2vec_name,
1965 name=pipe_name,
1966 unknown=tok2vec_name,
1967 opts=", ".join(self.pipe_names),
1968 )
-> 1969 raise ValueError(err)
1970 if pipe_name not in self.pipe_names:
1971 err = Errors.E889.format(
1972 tok2vec=tok2vec_name,
1973 name=pipe_name,
1974 unknown=pipe_name,
1975 opts=", ".join(self.pipe_names),
1976 )
ValueError: [E889] Can't replace 'tok2vec' listeners of component 'textcat_multilabel' because 'tok2vec' is not in the pipeline. Available components: textcat_multilabel. If you didn't call nlp.replace_listeners manually, this is likely a bug in spaCy.
Let me convert this to a discussion...
I'm training a
spaCy
multi-label text classification model usingen_core_web_trf
fromspaCy
transformers. The data and config file I use are generated through aprodigy data-to-spacy
call. The issue is that when I try to reload the model usingspacy.load("path/to/mod")
, it returns:ValueError: Cannot deserialize model: mismatched structure
. Based on theprodigy
forums, this ought to have been fixed withspacy-transformers
version1.0.6
but I'm runningspacy-transformers==1.1.8
so I believe there's still a bug somewhere, likely inspacy-transformers
, see #8566. I can't share the data, but I'll do my best to make the issue reproducible.How to reproduce the behaviour (from a notebook):
On the basis of this advice, this error can be worked around in either of two ways:
spacy.load(out_path)
tospacy.load(out_path,disable="tagger,parser,attribute_ruler,lemmatizer,ner")
out_path/config.cfg
frompipeline = ["transformer","tagger","parser","attribute_ruler","lemmatizer","ner","textcat_multilabel"]
topipeline = ["transformer","textcat_multilabel"]
However, neither of these are ideal. For instance, getting these models to work with spacy-report requires editing the source code of that package. It seems there's still a bug relating to the frozen components in the call to
spacy.train
!Here is an example observation from
dataset_name
:Your Environment
spacy validate
: