CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

About 10 lines are tagged as "I-O" #17

Closed xuhdev closed 3 years ago

xuhdev commented 4 years ago

"I-O" is not a valid tag. It should be "O". We corrected this by hand in our experiments, but we should fix this in the script.

eng.testa
11276:Kloof NNP I-NP I-O
11277:Gold NNP I-NP I-O
11278:Mining NNP I-NP I-O
11279:Co NNP I-NP I-O
42162:first JJ I-NP I-O
42163:division NN I-NP I-O
42217:first JJ I-NP I-O
42218:division NN I-NP I-O

eng.testb
12669:Zywiec NNP I-NP I-O
12670:Full NNP I-NP I-O
12671:Light NNP I-NP I-O
xuhdev commented 4 years ago

I think I got what's going on here. The reason is that these span corrections have tag O. The warning message says:

[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[120, 128): 'division' has an invalid tag O
[114, 122): 'division' has an invalid tag O
[3224, 3230): 'Zywiec' has an invalid tag O

Am I right that span shouldn't be applied to O tags?

BryanCutler commented 4 years ago

Yes, I-O is not valid. It should just be O. Is this an error in the original corpus?

xuhdev commented 4 years ago

@BryanCutler The correction files include span errors that correct spans of O-tagged entities. I guess that's the issue.

xuhdev commented 4 years ago

In other words, as the warning message indicates, the correction files contain some errors per my understanding

BryanCutler commented 3 years ago

Take a look at the span [3224, 3230): 'Zywiec' in all_conll_corrections_combined.csv. There is one entry that says it's a span error and should be Zywiec Full Light, while another entry says it's "wrong", then has [3231, 3241) Full Light as a span error with the correct span as Zywiec Full Light. The result is the same, but conflicting entries. Changing either way fixes the error, so I'm trying to have them match in the original csv file.