explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

biluo_tags_from_offsets() is generating empty '-' tags for some entities #5272

Closed erotavlas closed 4 years ago

erotavlas commented 4 years ago

spacy version 2.2.3

I have some gold standard annotated data set in spacy format like this

("""jvc/3/21/2008 Dr. John V. Smithn.""",{'entities':[(4,13,'DATE'),(18,32,'PERSON')]})

And I'm trying to convert it to IOB format. But at that particular example the function biluo_tags_from_offsets() produces this

-
O
B-PERSON
I-PERSON
I-PERSON

For these tokens

jvc/3/21/2008
Dr.
John
V.
Smithn

I have many examples of this, in fact I found for my test set there were 67 of them out of 45783 tokens across various annotations. I could not find a pattern to this or a cause, the text and annotations appear correct.

Unless it's a tokenization issue, because the first annotation for date starts inside what spacy thinks is a token.

However I find examples like this which appear to be tokenized correctly

Patient
Name
:
DUCK
,
DONALD
A.

Accession
#
:
S11
-
1234

yields

O
O
O
-
-
-
-
O
O
O
O
B-SPECIMENID
I-SPECIMENID
I-SPECIMENID
O

from this annotation

("""Patient Name: DUCK, DONALD A. Accession #: S11-1234 """,{'entities':[(14,28,'PERSON'),(47,55,'SPECIMENID')]})

My method looks like this

def get_gold(ner_model, examples):
    tag = []
    text = []
    for input_, annot in examples:
        #print(annot)
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot['entities'])
        tags = biluo_tags_from_offsets(doc_gold_text, annot['entities'])
        #print(tags)
        tag.extend(tags)
        for token in gold.words:
            text.append(token)

    # convert to IOB
    count = 0
    for i, item in enumerate(tag):
        if item == '-':
            count = count + 1
            print(count)
        if item.startswith('L-') == True:
            tag[i] = tag[i].replace("L-", "I-")
        elif item.startswith('U-') == True:
            tag[i] = tag[i].replace("U-", "B-")

    return text, tag
adrianeboyd commented 4 years ago

This means that the token boundaries aren't aligned with the character spans in the annotation. When spacy runs into these cases, it basically ignores the annotation because it doesn't know which of its tokens the annotation should apply to. It's different from O because O means no entity, so the - allows it to skip these cases. See a related comment: https://github.com/explosion/spaCy/issues/5112#issuecomment-595637564

Since it's not obvious to users when this happens, I've thought about adding more explicit warnings (see #5007), but it can get very noisy, especially if you're using the simple training scripts.

erotavlas commented 4 years ago

@adrianeboyd Does spacy ignore the entire sentence when training the model?

Or is this only an issue with the converter?

adrianeboyd commented 4 years ago

Just the NER annotation on those tokens are ignored. The NER model doesn't actually know anything about sentence boundaries, just document and token boundaries. The biluo_tags_from_offsets converter is used internally when you provide annotation in this format, as in the example training scripts.

erotavlas commented 4 years ago

@adrianeboyd
I've recently switched to the CLI for training so I'm converting that spacy format to json. Does this issue occur when converting the spacy format to json format for the CLI train command? Should I output my own BILUO output, and then convert that to json?

I'm wondering what the best solution to this is because when starting with a smaller training set, this could potentially eliminate quite a number of training examples from the data set.

Also it isn't always apparent what the tokenization rules are doing so an annotator may not be aware that his/her annotation is at the correct character positions.

adrianeboyd commented 4 years ago

Yes, that's why I wanted to add a warning, but I hadn't found an easy / non-overwhelming way to incorporate it yet. Ines has simplified some of the warnings setup for v3, so it may be easier to incorporate it there than for v2.

In spacy's JSON training format, if you provide "raw" text, you can still have misaligned tokens where the annotation is discarded because there's no way to map to spacy's tokenization. I think for NER annotation it only matters whether the start and end of the span are correct, since the tokenization in the middle doesn't affect the final span. For things like fine-grained POS tags, which are always tied directly to a token, the annotation for all the misaligned tokens is ignored.

If you don't provide a "raw" text then it trains from the gold tokenization and no annotation is discarded, but you get a better picture of the model's performance on real texts by including "raw", since you see how the actual tokenizer performance affects the model performance. The tokenization accuracy is included in the train CLI output and the model's meta.json.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.