Closed xuhdev closed 3 years ago
I think I got what's going on here. The reason is that these span
corrections have tag O
. The warning message says:
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[120, 128): 'division' has an invalid tag O
[114, 122): 'division' has an invalid tag O
[3224, 3230): 'Zywiec' has an invalid tag O
Am I right that span
shouldn't be applied to O
tags?
Yes, I-O
is not valid. It should just be O
. Is this an error in the original corpus?
@BryanCutler The correction files include span
errors that correct spans of O
-tagged entities. I guess that's the issue.
In other words, as the warning message indicates, the correction files contain some errors per my understanding
Take a look at the span [3224, 3230): 'Zywiec'
in all_conll_corrections_combined.csv
. There is one entry that says it's a span error and should be Zywiec Full Light
, while another entry says it's "wrong", then has [3231, 3241) Full Light
as a span error with the correct span as Zywiec Full Light
. The result is the same, but conflicting entries. Changing either way fixes the error, so I'm trying to have them match in the original csv file.
"I-O" is not a valid tag. It should be "O". We corrected this by hand in our experiments, but we should fix this in the script.