CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

Adjust audited files so that we can generate all_conll_corrections_combined.csv automatically #36

Closed frreiss closed 3 years ago

frreiss commented 3 years ago

This PR adjusts the CSV files under corrected_labels/human_labels_audited such that the Label_Stats.ipynb notebook will generate a correct version of corrected_labels/all_conll_corrections_combined.csv when run.

Most of the edits involved adjusting corrections in the input files to eliminate conflicting instructions for fixing errors.

The next most common type of edit involved fixing typos in spans -- extra spaces, incorrect offsets, and so on.

The least common type of edit involved fixing erroneous corrections. Most of these corrections had already been applied in previous manual edits of all_conll_corrections_combined.csv.

Note to reviewers: I recommend that you start by looking at the changes to corrected_labels/all_conll_corrections_combined.csv.

I've also added some .gitignore files to keep the corpus from being inadvertently added to Git.

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

frreiss commented 3 years ago

I looked through the edits and what I saw seemed good to me, but it's hard to tell since there are a lot of changes. Is there a good way we can verify the corrections are correct? Maybe doing a diff on the corrected corpus before/after the change?

Good idea, I'll PM you the corrected corpus after the change. Don't want to attach it to a Github issue for copyright reasons.

frreiss commented 3 years ago

Pushed some additional changes:

stefan-it commented 3 years ago

Hi @frreiss ,

thanks for these corrections!

I would like to ask if the version of this PR can be used e.g. for further experiments. We would like to use the corrected dataset for more experiments for our paper :hugs:

frreiss commented 3 years ago

Hi @stefan-it, sorry for the delay in getting back to you. I've just come back from vacation and am making one additional pass through the diffs on this branch before merging this PR. After that I'll tag a new release.

frreiss commented 3 years ago

Finished going through the line-level diffs. I made some additional fixes to the CSV files. Most fixes involved workarounds for how scripts/download_and_correct_corpus.py deals with multiple changes that affect the same entity mention. Merging this PR.

frreiss commented 3 years ago

@stefan-it I've posted a new release, version 0.2, with the corrections from this PR.

stefan-it commented 3 years ago

Thanks for releasing that new version :heart: I will start to work with the new release now :+1: