Closed BryanCutler closed 3 years ago
@xuhdev could you please review? I verified the output is the same by running before/after and comparing output files.
TODO: update main readme after this
Also, I'm seeing some errors that I don't remember from before
INFO:root:Getting CoNLL-2003 Corpus..
INFO:root:CoNLL-2003 Corpus downloaded to: /home/bryan/git/stc/Identifying-Incorrect-Labels-In-CoNLL-2003/original_corpus
INFO:root:Correcting labels for fold 'train'
[WARNING] Could not find [224, 257): 'OCASEK GOVERNMENT OFFICE BUILDING'
[WARNING] Correct_ent_type for line 909 is empty. Skipping...
[WARNING] Could not find [21, 24): 'T&N'
INFO:root:Correcting sentence boundaries for fold 'train'
INFO:root:Corrected corpus fold train to file: 'corrected_corpus/eng.train'
INFO:root:Correcting labels for fold 'dev'
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields ' Kloof Gold Mining Co' has an invalid tag O
[476, 539): 'Driefontein Consolidated and Gold Fields' Kloof Gold Mining Co' has an invalid tag O
[120, 128): 'division' has an invalid tag O
[114, 122): 'division' has an invalid tag O
INFO:root:Correcting sentence boundaries for fold 'dev'
INFO:root:Corrected corpus fold dev to file: 'corrected_corpus/eng.testa'
INFO:root:Correcting labels for fold 'test'
[3224, 3230): 'Zywiec' has an invalid tag O
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
Skip span error for '(Iowa-S) Minn'. Please correct it by hand.
INFO:root:Correcting sentence boundaries for fold 'test'
INFO:root:Corrected corpus fold test to file: 'corrected_corpus/eng.testb'
are you seeing these too?
@BryanCutler Yes I'm also seeing these. The invalid tags only appears after the sentence boundary error correction was incorporated.
@BryanCutler The error message you see seems to relate to #17, as the relevant lines seem to overlap significantly
@xuhdev I added a check for all required files/dirs, so if the user does something strange it should raise an error that tells them to run under project root dir. Let me know if this looks ok now.
Going to go ahead an merge this, we can do more options as a followup if needed.
Cleans up the script to download and correct the corpus by removing intermediate label-only corrected files and moving content of other scripts to be called directly so they can be removed.