Closed thejamesmarq closed 3 years ago
I don't think this error is coming from the NER component, but you can set missing NER annotations either with -
as the IOB annotation in Example.from_dict
/Doc(ents=)
or by using doc.set_ents()
. This would set all tokens to have missing NER annotation in an existing doc:
doc.set_ents([], default="missing")
In terms of the error, it would be helpful to see the full config, some sample docs/annotation, and the full error traceback, since otherwise it's pretty hard to figure out what's going on. You could also try training with just NER or just textcat to see if that helps narrow things down.
@adrianeboyd Here's my config
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["tok2vec", "textcat", "ner"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
[components.textcat]
factory = "textcat_multilabel"
name = "textcat"
threshold = 0.5
[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null
[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 2
no_output_layer = false
nO = null
[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.ner]
factory = "ner"
name = "ner"
moves = null
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = true
max_length = 0
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = true
max_length = 0
limit = 0
augmenter = null
[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1000
max_epochs = 1
max_steps = 10
eval_frequency = 1
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = 1.0
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
[initialize.components]
[initialize.tokenizer]
Traceback (most recent call last):
File "/Users/james/.pyenv/versions/3.8.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/james/.pyenv/versions/3.8.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/__main__.py", line 4, in <module>
setup_cli()
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/cli/_util.py", line 69, in setup_cli
command(prog_name=COMMAND)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/cli/train.py", line 59, in train_cli
train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/training/loop.py", line 115, in train
raise e
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/training/loop.py", line 98, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/training/loop.py", line 195, in train_while_improving
nlp.update(
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/language.py", line 1112, in update
proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/pipeline/textcat.py", line 205, in update
bp_scores(d_scores)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/thinc/layers/chain.py", line 60, in backprop
dX = callback(dY)
File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 68, in backprop
dX += bwd(dY)
TypeError: 'NoneType' object is not iterable
An example of a Doc
that has classification labels not span labels (values of Doc.ents
and Doc.cats
):
doc.ents -> ()
doc.cats -> {'is_urgent_case': True}
An example of a Doc
that has span labels but not classification labels (values of Doc.ents
[spans] and Doc.cats
):
doc.ents -> (Type 2 diabetes,)
doc.cats -> {}
I should mention, this same error (with stack trace) happens when I use a separate config for just the NER component, using the same data. When I train the NER component excluding the examples that have cats
but no ents
everything works fine.
Ah, one thing I didn't think of is that unfortunately there is no way to indicate "missing" doc.cats
, so it's probably not going to be possible to train (with good results) from mixed data where some docs don't have cats.
I think at this point I would recommend not trying to use a shared tok2vec component and training the two models separately. It's easiest then if the tok2vec
is not a listener, but internal to the component, as in this config (similar to en_core_web_sm
):
[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"
[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false
For the NER model, I think you'll get better results with the default MultiHashEmbed
features which should include more than ORTH
and SHAPE
.
If you keep running into errors, could you attach a .spacy
file with a doc that causes this error (anonymized as necessary, of course)? Unless you're setting ENT_IOB
from cython, we've made it pretty hard to set invalid NER annotation, so I'm having trouble figuring out a way this is from the NER annotation, but maybe there's something else going on.
Ahh, that makes sense. I'll try that out.
Separately, do you think I might be able to overcome this by having two separate configs, one for NER, one for classification, and then source the tok2vec
and learned component into the other? I'm thinking of this flow:
tok2vec
and ner
components, using data that only contains annotations for NERtok2vec
and ner
components from the first, with ner
as a frozen + disabled componentIf you freeze ner
but not tok2vec
, then training further will cause the tok2vec
will be modified to work better for textcat
and the ner
performance will be (very) degraded.
I would:
ner
from one config with an internal tok2vec
textcat
from another config with an internal tok2vec
If you want you can you use source
+ frozen_components
to go ahead and include ner
in the second config, or you can collate them later in another way. You could also use a third config with spacy assemble
that sources both to create the final pipeline. There are a lot of options for how to combine them and I think it's simplest to train them separately. As an example, for the pretrained pipelines, we have components trained with 2-3 separate configs that are merged together with a short collate script that uses nlp.add_pipe(source=)
.
I don't think that having a shared tok2vec
is going to be that helpful overall to the performance even if you had combined data. The shared tok2vec
makes sense for components that make similar kinds of predictions (tagger
+ parser
, for instance), but much less sense for ner
+ textcat
.
Thanks for all that input.
I was able to get the solution I described earlier working...with pretty much the metrics issues you were describing. I think it does make sense to split them off.
This issue has been automatically closed because it was answered and there was no follow-up discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I am trying to train a textcat and ner model that have a shared tok2vec component using configs. My training data comes from two different processes; on automated and one manual, so some of my examples have
cats
and some haveentities
. When training, I get the errorTypeError("'NoneType' object is not iterable")
, and my stack trace has this at the endI think this is related to support for missing annotations in the NER data. An old issue (https://github.com/explosion/spaCy/issues/2603) brought this up, and the solution was to use use IOB formatting as opposed to spans, since IOB can indicate missing values with
None
. I am trying to get this to work with configs, and have my training data stored asDocBin
s, which do not allow the IOB format forents
, and am using thespacy.Corpus.v1
reader (https://github.com/explosion/spaCy/blob/master/spacy/training/corpus.py). This reader yields examples directly from aDocBin
, so they cannot be in IOB format.I'm wondering if it would be reasonable to add an option to handle missing values in the reader, which could be specified in the config.
Your Environment
Info about spaCy