explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.27k stars 4.4k forks source link

NER Missing values with configs #8103

Closed thejamesmarq closed 3 years ago

thejamesmarq commented 3 years ago

I am trying to train a textcat and ner model that have a shared tok2vec component using configs. My training data comes from two different processes; on automated and one manual, so some of my examples have cats and some have entities. When training, I get the error TypeError("'NoneType' object is not iterable"), and my stack trace has this at the end

  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/thinc/layers/chain.py", line 60, in backprop
    dX = callback(dY)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 68, in backprop
    dX += bwd(dY)
TypeError: 'NoneType' object is not iterable

I think this is related to support for missing annotations in the NER data. An old issue (https://github.com/explosion/spaCy/issues/2603) brought this up, and the solution was to use use IOB formatting as opposed to spans, since IOB can indicate missing values with None. I am trying to get this to work with configs, and have my training data stored as DocBins, which do not allow the IOB format for ents, and am using the spacy.Corpus.v1 reader (https://github.com/explosion/spaCy/blob/master/spacy/training/corpus.py). This reader yields examples directly from a DocBin, so they cannot be in IOB format.

I'm wondering if it would be reasonable to add an option to handle missing values in the reader, which could be specified in the config.

Your Environment

Info about spaCy

adrianeboyd commented 3 years ago

I don't think this error is coming from the NER component, but you can set missing NER annotations either with - as the IOB annotation in Example.from_dict/Doc(ents=) or by using doc.set_ents(). This would set all tokens to have missing NER annotation in an existing doc:

doc.set_ents([], default="missing")

In terms of the error, it would be helpful to see the full config, some sample docs/annotation, and the full error traceback, since otherwise it's pretty hard to figure out what's going on. You could also try training with just NER or just textcat to see if that helps narrow things down.

thejamesmarq commented 3 years ago

@adrianeboyd Here's my config

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["tok2vec", "textcat", "ner"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.textcat]
factory = "textcat_multilabel"
name = "textcat"
threshold = 0.5

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 2
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[components.ner]
factory = "ner"
name = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = true
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = true
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1000
max_epochs = 1
max_steps = 10
eval_frequency = 1
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = 1.0
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.components]

[initialize.tokenizer]
thejamesmarq commented 3 years ago
Traceback (most recent call last):
  File "/Users/james/.pyenv/versions/3.8.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/james/.pyenv/versions/3.8.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/cli/_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/cli/train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/training/loop.py", line 115, in train
    raise e
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/training/loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/training/loop.py", line 195, in train_while_improving
    nlp.update(
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/language.py", line 1112, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/spacy/pipeline/textcat.py", line 205, in update
    bp_scores(d_scores)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/thinc/layers/chain.py", line 60, in backprop
    dX = callback(dY)
  File "/Users/james/code/question-bot-spacy/qbot_spacy_venv/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 68, in backprop
    dX += bwd(dY)
TypeError: 'NoneType' object is not iterable
thejamesmarq commented 3 years ago

An example of a Doc that has classification labels not span labels (values of Doc.ents and Doc.cats):

doc.ents -> ()
doc.cats -> {'is_urgent_case': True}

An example of a Doc that has span labels but not classification labels (values of Doc.ents [spans] and Doc.cats):

doc.ents -> (Type 2 diabetes,)
doc.cats -> {}
thejamesmarq commented 3 years ago

I should mention, this same error (with stack trace) happens when I use a separate config for just the NER component, using the same data. When I train the NER component excluding the examples that have cats but no ents everything works fine.

adrianeboyd commented 3 years ago

Ah, one thing I didn't think of is that unfortunately there is no way to indicate "missing" doc.cats, so it's probably not going to be possible to train (with good results) from mixed data where some docs don't have cats.

I think at this point I would recommend not trying to use a shared tok2vec component and training the two models separately. It's easiest then if the tok2vec is not a listener, but internal to the component, as in this config (similar to en_core_web_sm):

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.ner.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 96
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,2500,2500,2500]
include_static_vectors = false

For the NER model, I think you'll get better results with the default MultiHashEmbed features which should include more than ORTH and SHAPE.

If you keep running into errors, could you attach a .spacy file with a doc that causes this error (anonymized as necessary, of course)? Unless you're setting ENT_IOB from cython, we've made it pretty hard to set invalid NER annotation, so I'm having trouble figuring out a way this is from the NER annotation, but maybe there's something else going on.

thejamesmarq commented 3 years ago

Ahh, that makes sense. I'll try that out.

Separately, do you think I might be able to overcome this by having two separate configs, one for NER, one for classification, and then source the tok2vec and learned component into the other? I'm thinking of this flow:

adrianeboyd commented 3 years ago

If you freeze ner but not tok2vec, then training further will cause the tok2vec will be modified to work better for textcat and the ner performance will be (very) degraded.

I would:

If you want you can you use source + frozen_components to go ahead and include ner in the second config, or you can collate them later in another way. You could also use a third config with spacy assemble that sources both to create the final pipeline. There are a lot of options for how to combine them and I think it's simplest to train them separately. As an example, for the pretrained pipelines, we have components trained with 2-3 separate configs that are merged together with a short collate script that uses nlp.add_pipe(source=).

I don't think that having a shared tok2vec is going to be that helpful overall to the performance even if you had combined data. The shared tok2vec makes sense for components that make similar kinds of predictions (tagger + parser, for instance), but much less sense for ner + textcat.

thejamesmarq commented 3 years ago

Thanks for all that input.

I was able to get the solution I described earlier working...with pretty much the metrics issues you were describing. I think it does make sense to split them off.

github-actions[bot] commented 3 years ago

This issue has been automatically closed because it was answered and there was no follow-up discussion.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.