Closed mbrunecky closed 3 years ago
Sorry to hear you're having trouble with this.
I understand that you've been able to train a non-transformer pipeline on the same data, but can you confirm that this is actually a complete sample in the data?
Dated|I May|I 29,2018|I
(Maybe with an extra newline?) As the error indicates, this has I tags without a B tag before them, and is not a valid annotation.
Each of my documents is 'complete', meaning that it has complete text and entity labels. I am annotating only two entities, NAME_FROM and NAME_TO, and those are 'names' (i.e. Wells Fargo Bank NA or John Brown). Definitely not dates such as the one shown above. There is on average 2.96/2.84 entities per document, and average document has about 3.2 k of text. The 'document' is generated from our (Java) code producing the 'training format' (as JSONL) and then converted into annotated doc using the offsets_to_biluo_tags(). With Spacy 3 I can batch any number of such documents into DocBin (this problem I encounter both with batching 1 or 100 docs/DocBin). The log I posted is from a data 'subset' of 500 training / 50 dev (validation) documents, but I was getting the same problem on a much larger sets of data (up to about 8000 training / 2000 dev). Your question made me try a different 'subset': as opposed to the first 500 documents, I took the last 200 documents (out of 5000). The result is the same failure, except that the error does not show the bad text:
⚠ Aborting and saving the final best model. Encountered exception:
ValueError('[E093] token.ent_iob values make invalid sequence: I without B\n')
Traceback (most recent call last):
...
ValueError: [E093] token.ent_iob values make invalid sequence: I without B
"Start 16:23:35.68 stop 16:25:57.12"
Since (regardless of the data sample) the error always happens at the same 'moment' (after reporting the epoch '0' results and then running 20 threads in parallel for about a minute), I do not believe it's the data markup. Besides, your offsets_to_biluo_tags() is not forgiving at all - I doubt it would generate a bad sequence. My data is from OCR, so it does contain various oddities - I had to deal with my entity end landing in text such as "Inc.,and" and assure there are (Spacy recognized) delimiters in the 'right' place. train.zip valid.zip
I also tried:
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
changed to 256 (perhaps getting the spans messes things up), but it failed the same way ... And I am not sure how the batching/splitting works.
Posted files are for 200 training, 40 dev documents, but I reduced them to half (100 training, 20 dev) and got the same error, at the same moment: after reporting the 'epoch 0' and then IMO completing the next epoch training and going to finish updates (or perhaps start validation). Perhaps the 'culprit' is my machine, it has 20 physical cores (40 logical with hyperthreading). And it has a nasty habit of exposing thread synchronization mistakes, because those threads really DO run in parallel :-).
I experience similar issues when I try to train a Hungarian NER model (not transformer). @polm shall I post here the details or open a separate issue?
My problem is with the transformer pipeline, my data goes thru the non-transformer pipelines without any problem. My training/validation data is generated using Spacy conversion utilities, using the offsets_to_biluo_tags(). IF you are having problem in a non-transformer pipeline, it is probably a different issue. And may very likely be caused by some subtle mistake in tag generation. I can not imagine generating the original Spacy JSON doc data format with any other than Spacy code, because tagging must align with Spacy tokenization - which is not trivial.
Hi, it turns out that this error is not related to transformers or CPU vs. GPU or multithreading, just the training data and the config settings. You can see the same error just with spacy debug data
, no training involved.
What's going on is a bug, but the underlying issue is that spacy doesn't really expect entity spans to cross sentence boundaries, and as a result some of the behavior here isn't very well tested. The ner
model doesn't predict entities that cross sentence boundaries, either.
When max_length
for spacy.Corpus.v1
is lower than the document length, the document gets split into individual sentences if sentence boundaries are present, which they are due to the dependency parses in this data. The training corpus contains a long text where there is a sentence boundary in the middle of an entity and when it gets converted into sentences, the token.ent_iob
value isn't converted correctly for the first token in the sentence and it ends up in an invalid state. The bug itself is in Span.as_doc()
.
The reason this looks like it might be related to transformers is that the default configs have different values for the corpus max_length
depending on the transformer option.
@oroszgy: If you're seeing the exact same error code it's probably the same issue. If not, please open a new discussion thread with the details for your training setup and the errors you're seeing.
Thank you, Adriane. Over the weekend, I managed to run into the same problem in one of my other CPU-only NER problems. Now I am trying to verify that using higher corpus max_length avoids it. I am not sure I understand the impact of 'splitting' the document (always one page), because the split may come close to the entity, affecting entity context. Sentence boundary determination in my OCR data is unreliable: the dot delimiter is frequently missed OR sometimes added where it does not belong, and so is the word spacing. Sentence boundary should never fall within an entity, and it is probably an artifact of incorrect sentence determination due to OCR. I will look into 'clensing' my entities content to assure that never happens. That said, part of my problem is that the convert utility does not support 'from training data format', and I have to use approach (abbreviated):
nlp = spacy.load('en_core_web_lg')
doc = nlp(text)
tags = offsets_to_biluo_tags(doc, annots['entities'])
doc.ents = biluo_tags_to_spans(doc, tags)
docbin = DocBin()
docbin.add(doc)
docbin.to_disk(db_file)
Until now I did not realize that using the 'en_core_web_xx' (sm vs lg) has a significant impact upon generated data. Perhaps I need to experiment with disabling pipeline components. All I want is a 'minimal' document (tokens and entity tags), not even POS tags or sentencing.
Yes, having the documents split into inaccurate sentences is probably not helpful for your NER results. If you don't set sentence boundaries, try max_length = 0
so that it doesn't skip any training documents. You may need to lower the training batch size if you run into memory issues. Otherwise it will completely skip training documents that are too long, which is probably not what you want. But if you're still running out of memory, splitting the training documents into smaller docs (sections, chapters, paragraphs) could be helpful, too, and let the training loop shuffle instances a bit more while training, too.
If you are just creating training data, use a blank pipeline that just contains a tokenizer:
nlp = spacy.blank("en")
doc = nlp(text)
Or if you already have a pipeline loaded for some other purpose, you can use nlp.make_doc
to only run the tokenizer:
doc = nlp.make_doc(text)
If it's helpful, there's an example conversion script for the NER TRAIN_DATA
format (saved as JSON) here:
https://github.com/explosion/projects/blob/v3/pipelines/ner_demo/scripts/convert.py
You can see the data in assets/
in that project: https://github.com/explosion/projects/tree/v3/pipelines/ner_demo
Using doc.char_span
has the advantage that you can use the alignment_mode
option to snap misaligned entity spans to token boundaries if needed.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
How to reproduce the behaviour
I am trying to demonstrate how much benefit my NER projects would gain IF I could train using GPU and transformer pipeline, instead of CPU only (using static vectors). My current GPU has only 6GB so I run out of memory very soon, and I am willing to run this 'comparison' on my 40 logical cores machine for days.
Using the 'Quickstart' confguration expanded into a full config.cfg, I keep failing in update() after the first epoch, regardless of data set (size, content) or how many 'doc' I batch in a DocBin. The data works fine in non-transformer pipeline { tok2vec, ner }. The failure trace is always the same:
My config_gpu.cfg
Your Environment