Closed erotavlas closed 4 years ago
This means that the token boundaries aren't aligned with the character spans in the annotation. When spacy runs into these cases, it basically ignores the annotation because it doesn't know which of its tokens the annotation should apply to. It's different from O
because O
means no entity, so the -
allows it to skip these cases. See a related comment: https://github.com/explosion/spaCy/issues/5112#issuecomment-595637564
Since it's not obvious to users when this happens, I've thought about adding more explicit warnings (see #5007), but it can get very noisy, especially if you're using the simple training scripts.
@adrianeboyd Does spacy ignore the entire sentence when training the model?
Or is this only an issue with the converter?
Just the NER annotation on those tokens are ignored. The NER model doesn't actually know anything about sentence boundaries, just document and token boundaries. The biluo_tags_from_offsets
converter is used internally when you provide annotation in this format, as in the example training scripts.
@adrianeboyd
I've recently switched to the CLI for training so I'm converting that spacy format to json.
Does this issue occur when converting the spacy format to json format for the CLI train command?
Should I output my own BILUO output, and then convert that to json?
I'm wondering what the best solution to this is because when starting with a smaller training set, this could potentially eliminate quite a number of training examples from the data set.
Also it isn't always apparent what the tokenization rules are doing so an annotator may not be aware that his/her annotation is at the correct character positions.
Yes, that's why I wanted to add a warning, but I hadn't found an easy / non-overwhelming way to incorporate it yet. Ines has simplified some of the warnings setup for v3, so it may be easier to incorporate it there than for v2.
In spacy's JSON training format, if you provide "raw"
text, you can still have misaligned tokens where the annotation is discarded because there's no way to map to spacy's tokenization. I think for NER annotation it only matters whether the start and end of the span are correct, since the tokenization in the middle doesn't affect the final span. For things like fine-grained POS tags, which are always tied directly to a token, the annotation for all the misaligned tokens is ignored.
If you don't provide a "raw"
text then it trains from the gold tokenization and no annotation is discarded, but you get a better picture of the model's performance on real texts by including "raw"
, since you see how the actual tokenizer performance affects the model performance. The tokenization accuracy is included in the train CLI output and the model's meta.json
.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
spacy version 2.2.3
I have some gold standard annotated data set in spacy format like this
("""jvc/3/21/2008 Dr. John V. Smithn.""",{'entities':[(4,13,'DATE'),(18,32,'PERSON')]})
And I'm trying to convert it to IOB format. But at that particular example the function
biluo_tags_from_offsets()
produces thisFor these tokens
I have many examples of this, in fact I found for my test set there were 67 of them out of 45783 tokens across various annotations. I could not find a pattern to this or a cause, the text and annotations appear correct.
Unless it's a tokenization issue, because the first annotation for date starts inside what spacy thinks is a token.
However I find examples like this which appear to be tokenized correctly
yields
from this annotation
("""Patient Name: DUCK, DONALD A. Accession #: S11-1234 """,{'entities':[(14,28,'PERSON'),(47,55,'SPECIMENID')]})
My method looks like this