explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.13k stars 4.4k forks source link

NER transformer training fails in CPU-only mode #8026

Closed mbrunecky closed 3 years ago

mbrunecky commented 3 years ago

How to reproduce the behaviour

I am trying to demonstrate how much benefit my NER projects would gain IF I could train using GPU and transformer pipeline, instead of CPU only (using static vectors). My current GPU has only 6GB so I run out of memory very soon, and I am willing to run this 'comparison' on my 40 logical cores machine for days.

Using the 'Quickstart' confguration expanded into a full config.cfg, I keep failing in update() after the first epoch, regardless of data set (size, content) or how many 'doc' I batch in a DocBin. The data works fine in non-transformer pipeline { tok2vec, ner }. The failure trace is always the same:


["Start 16:13:17.03 in C:\Work\ML\Spacy3\dataset\ca_placer_dee_gpu"
A subdirectory or file C:\Work\ML\Spacy3\dataset\ca_placer_dee_gpu\model_gpu already exists.
python -m spacy train C:\Work\ML\Spacy3\dataset\ca_placer_dee_gpu/config_gpu.cfg  --output  C:\Work\ML\Spacy3\dataset\ca_placer_dee_gpu/model_gpu --paths.train C:\Work\ML\Spacy3\dataset\ca_placer_dee_gpu/train  --paths.dev C:\Work\ML\Spacy3\dataset\ca_placer_dee_gpu/valid
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
←[1m
=========================== Initializing pipeline ===========================←[0m
[2021-05-06 16:13:19,921] [INFO] Set up nlp object from config
[2021-05-06 16:13:19,925] [INFO] Pipeline: ['transformer', 'ner']
[2021-05-06 16:13:19,925] [INFO] Created vocabulary
[2021-05-06 16:13:19,925] [INFO] Finished initializing nlp object
[2021-05-06 16:13:52,046] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  -------------  --------  ------  ------  ------  ------
  0       0        5708.08    601.45    0.06    0.03    0.60    0.00
⚠ Aborting and saving the final best model. Encountered exception:
ValueError('[E093] token.ent_iob values make invalid sequence: I without
B\nDated|I May|I 29,2018|I \n\n|O')
Traceback (most recent call last):
  File "C:\Program Files\Python\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\__main__.py", line 4, in <module>
    setup_cli()
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\cli\_util.py", line 69, in setup_cli
    command(prog_name=COMMAND)
  File "C:\Work\ML\Spacy3\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "C:\Work\ML\Spacy3\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "C:\Work\ML\Spacy3\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Work\ML\Spacy3\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Work\ML\Spacy3\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "C:\Work\ML\Spacy3\lib\site-packages\typer\main.py", line 497, in wrapper
    return callback(**use_params)  # type: ignore
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\cli\train.py", line 59, in train_cli
    train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\training\loop.py", line 115, in train
    raise e
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\training\loop.py", line 98, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\training\loop.py", line 195, in train_while_improving
    nlp.update(
  File "C:\Work\ML\Spacy3\lib\site-packages\spacy\language.py", line 1112, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])
  File "spacy\pipeline\transition_parser.pyx", line 350, in spacy.pipeline.transition_parser.Parser.update
  File "spacy\pipeline\transition_parser.pyx", line 601, in spacy.pipeline.transition_parser.Parser._init_gold_batch
  File "spacy\pipeline\_parser_internals\ner.pyx", line 273, in spacy.pipeline._parser_internals.ner.BiluoPushDown.init_gold
  File "spacy\pipeline\_parser_internals\ner.pyx", line 53, in spacy.pipeline._parser_internals.ner.BiluoGold.__init__
  File "spacy\pipeline\_parser_internals\ner.pyx", line 69, in spacy.pipeline._parser_internals.ner.create_gold_state
  File "spacy\training\example.pyx", line 241, in spacy.training.example.Example.get_aligned_ner
  File "spacy\tokens\doc.pyx", line 698, in spacy.tokens.doc.Doc.ents.__get__
ValueError: [E093] token.ent_iob values make invalid sequence: I without B
Dated|I May|I 29,2018|I

|O
"Start 16:13:17.03 stop 16:15:50.21"

My config_gpu.cfg

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 96
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 500
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Your Environment

polm commented 3 years ago

Sorry to hear you're having trouble with this.

I understand that you've been able to train a non-transformer pipeline on the same data, but can you confirm that this is actually a complete sample in the data?

Dated|I May|I 29,2018|I

(Maybe with an extra newline?) As the error indicates, this has I tags without a B tag before them, and is not a valid annotation.

mbrunecky commented 3 years ago

Each of my documents is 'complete', meaning that it has complete text and entity labels. I am annotating only two entities, NAME_FROM and NAME_TO, and those are 'names' (i.e. Wells Fargo Bank NA or John Brown). Definitely not dates such as the one shown above. There is on average 2.96/2.84 entities per document, and average document has about 3.2 k of text. The 'document' is generated from our (Java) code producing the 'training format' (as JSONL) and then converted into annotated doc using the offsets_to_biluo_tags(). With Spacy 3 I can batch any number of such documents into DocBin (this problem I encounter both with batching 1 or 100 docs/DocBin). The log I posted is from a data 'subset' of 500 training / 50 dev (validation) documents, but I was getting the same problem on a much larger sets of data (up to about 8000 training / 2000 dev). Your question made me try a different 'subset': as opposed to the first 500 documents, I took the last 200 documents (out of 5000). The result is the same failure, except that the error does not show the bad text:

⚠ Aborting and saving the final best model. Encountered exception:
ValueError('[E093] token.ent_iob values make invalid sequence: I without B\n')
Traceback (most recent call last):
...
ValueError: [E093] token.ent_iob values make invalid sequence: I without B

"Start 16:23:35.68 stop 16:25:57.12"

Since (regardless of the data sample) the error always happens at the same 'moment' (after reporting the epoch '0' results and then running 20 threads in parallel for about a minute), I do not believe it's the data markup. Besides, your offsets_to_biluo_tags() is not forgiving at all - I doubt it would generate a bad sequence. My data is from OCR, so it does contain various oddities - I had to deal with my entity end landing in text such as "Inc.,and" and assure there are (Spacy recognized) delimiters in the 'right' place. train.zip valid.zip

I also tried:

@span_getters = "spacy-transformers.strided_spans.v1"
window = 128

changed to 256 (perhaps getting the spans messes things up), but it failed the same way ... And I am not sure how the batching/splitting works.

mbrunecky commented 3 years ago

Posted files are for 200 training, 40 dev documents, but I reduced them to half (100 training, 20 dev) and got the same error, at the same moment: after reporting the 'epoch 0' and then IMO completing the next epoch training and going to finish updates (or perhaps start validation). Perhaps the 'culprit' is my machine, it has 20 physical cores (40 logical with hyperthreading). And it has a nasty habit of exposing thread synchronization mistakes, because those threads really DO run in parallel :-).

oroszgy commented 3 years ago

I experience similar issues when I try to train a Hungarian NER model (not transformer). @polm shall I post here the details or open a separate issue?

mbrunecky commented 3 years ago

My problem is with the transformer pipeline, my data goes thru the non-transformer pipelines without any problem. My training/validation data is generated using Spacy conversion utilities, using the offsets_to_biluo_tags(). IF you are having problem in a non-transformer pipeline, it is probably a different issue. And may very likely be caused by some subtle mistake in tag generation. I can not imagine generating the original Spacy JSON doc data format with any other than Spacy code, because tagging must align with Spacy tokenization - which is not trivial.

adrianeboyd commented 3 years ago

Hi, it turns out that this error is not related to transformers or CPU vs. GPU or multithreading, just the training data and the config settings. You can see the same error just with spacy debug data, no training involved.

What's going on is a bug, but the underlying issue is that spacy doesn't really expect entity spans to cross sentence boundaries, and as a result some of the behavior here isn't very well tested. The ner model doesn't predict entities that cross sentence boundaries, either.

When max_length for spacy.Corpus.v1 is lower than the document length, the document gets split into individual sentences if sentence boundaries are present, which they are due to the dependency parses in this data. The training corpus contains a long text where there is a sentence boundary in the middle of an entity and when it gets converted into sentences, the token.ent_iob value isn't converted correctly for the first token in the sentence and it ends up in an invalid state. The bug itself is in Span.as_doc().

The reason this looks like it might be related to transformers is that the default configs have different values for the corpus max_length depending on the transformer option.

@oroszgy: If you're seeing the exact same error code it's probably the same issue. If not, please open a new discussion thread with the details for your training setup and the errors you're seeing.

mbrunecky commented 3 years ago

Thank you, Adriane. Over the weekend, I managed to run into the same problem in one of my other CPU-only NER problems. Now I am trying to verify that using higher corpus max_length avoids it. I am not sure I understand the impact of 'splitting' the document (always one page), because the split may come close to the entity, affecting entity context. Sentence boundary determination in my OCR data is unreliable: the dot delimiter is frequently missed OR sometimes added where it does not belong, and so is the word spacing. Sentence boundary should never fall within an entity, and it is probably an artifact of incorrect sentence determination due to OCR. I will look into 'clensing' my entities content to assure that never happens. That said, part of my problem is that the convert utility does not support 'from training data format', and I have to use approach (abbreviated):

    nlp = spacy.load('en_core_web_lg')
    doc  = nlp(text)
    tags = offsets_to_biluo_tags(doc, annots['entities'])   
    doc.ents = biluo_tags_to_spans(doc, tags)
    docbin = DocBin()
    docbin.add(doc)
    docbin.to_disk(db_file)

Until now I did not realize that using the 'en_core_web_xx' (sm vs lg) has a significant impact upon generated data. Perhaps I need to experiment with disabling pipeline components. All I want is a 'minimal' document (tokens and entity tags), not even POS tags or sentencing.

adrianeboyd commented 3 years ago

Yes, having the documents split into inaccurate sentences is probably not helpful for your NER results. If you don't set sentence boundaries, try max_length = 0 so that it doesn't skip any training documents. You may need to lower the training batch size if you run into memory issues. Otherwise it will completely skip training documents that are too long, which is probably not what you want. But if you're still running out of memory, splitting the training documents into smaller docs (sections, chapters, paragraphs) could be helpful, too, and let the training loop shuffle instances a bit more while training, too.

If you are just creating training data, use a blank pipeline that just contains a tokenizer:

nlp = spacy.blank("en")
doc = nlp(text)

Or if you already have a pipeline loaded for some other purpose, you can use nlp.make_doc to only run the tokenizer:

doc = nlp.make_doc(text)

If it's helpful, there's an example conversion script for the NER TRAIN_DATA format (saved as JSON) here:

https://github.com/explosion/projects/blob/v3/pipelines/ner_demo/scripts/convert.py

You can see the data in assets/ in that project: https://github.com/explosion/projects/tree/v3/pipelines/ner_demo

Using doc.char_span has the advantage that you can use the alignment_mode option to snap misaligned entity spans to token boundaries if needed.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.