CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

Cleanup download and correct script #16

Closed BryanCutler closed 3 years ago

BryanCutler commented 3 years ago

Cleans up the script to download and correct the corpus by removing intermediate label-only corrected files and moving content of other scripts to be called directly so they can be removed.

BryanCutler commented 3 years ago

@xuhdev could you please review? I verified the output is the same by running before/after and comparing output files.

BryanCutler commented 3 years ago

TODO: update main readme after this

BryanCutler commented 3 years ago

Also, I'm seeing some errors that I don't remember from before

INFO:root:Getting CoNLL-2003 Corpus..
INFO:root:CoNLL-2003 Corpus downloaded to: /home/bryan/git/stc/Identifying-Incorrect-Labels-In-CoNLL-2003/original_corpus
INFO:root:Correcting labels for fold 'train'
[WARNING] Could not find [224, 257): 'OCASEK GOVERNMENT OFFICE BUILDING'
[WARNING] Correct_ent_type for line 909 is empty. Skipping...
[WARNING] Could not find [21, 24): 'T&N'
INFO:root:Correcting sentence boundaries for fold 'train'
INFO:root:Corrected corpus fold train to file: 'corrected_corpus/eng.train'
INFO:root:Correcting labels for fold 'dev'
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[120, 128): 'division' has an invalid tag O
[114, 122): 'division' has an invalid tag O
INFO:root:Correcting sentence boundaries for fold 'dev'
INFO:root:Corrected corpus fold dev to file: 'corrected_corpus/eng.testa'
INFO:root:Correcting labels for fold 'test'
[3224, 3230): 'Zywiec' has an invalid tag O
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
INFO:root:Correcting sentence boundaries for fold 'test'
INFO:root:Corrected corpus fold test to file: 'corrected_corpus/eng.testb'

are you seeing these too?

xuhdev commented 3 years ago

@BryanCutler Yes I'm also seeing these. The invalid tags only appears after the sentence boundary error correction was incorporated.

xuhdev commented 3 years ago

@BryanCutler The error message you see seems to relate to #17, as the relevant lines seem to overlap significantly

BryanCutler commented 3 years ago

@xuhdev I added a check for all required files/dirs, so if the user does something strange it should raise an error that tells them to run under project root dir. Let me know if this looks ok now.

BryanCutler commented 3 years ago

Going to go ahead an merge this, we can do more options as a followup if needed.