CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

download_and_correct_corpus.py does not handle "Missing" errors that overlap with "Span" #34

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

In document 42 of the "dev" fold, the lines:

at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O I-ORG
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP I-ORG
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O

have the following three corrections applied to them:

dev,42,"[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co'",ORG,Span,"[476, 500): 'Driefontein Consolidated'",ORG,
dev,42,,,Missing,"[505, 516): 'Gold Fields'",ORG,
dev,42,,,Missing,"[519, 539): 'Kloof Gold Mining Co'",ORG,

After these corrections, the lines should be tagged as follows:

at IN I-PP O
Driefontein NNP I-NP I-ORG
Consolidated NNP I-NP I-ORG
and CC O O
Gold NNP I-NP I-ORG
Fields NNP I-NP I-ORG
' POS B-NP O
Kloof NNP I-NP I-ORG
Gold NNP I-NP I-ORG
Mining NNP I-NP I-ORG
Co NNP I-NP I-ORG
this DT B-NP O

Instead, download_and_correct_corpus.py produces this output:

at IN I-PP O
Driefontein NNP I-NP O
Consolidated NNP I-NP O
and CC O O
Gold NNP I-NP O
Fields NNP I-NP O
' POS B-NP O
Kloof NNP I-NP I-O
Gold NNP I-NP I-O
Mining NNP I-NP I-O
Co NNP I-NP I-O
this DT B-NP O

The tokens that should be tagged I-ORG as a result of the two "Missing" type corrections are instead tagged "O".

xuhdev commented 3 years ago

@frreiss The issue seems to be that some "Missing" corrections put the span to the "corpus_span" column but some put the span to the "correct_span" fields. The script currently uses the "corpus_span" column. Do you want me to use whichever one is present in the column? (This seems hacky though)

xuhdev commented 3 years ago

Seems like overwhelmingly "correct_span" is used. I'll change that