Closed frreiss closed 3 years ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
I looked through the edits and what I saw seemed good to me, but it's hard to tell since there are a lot of changes. Is there a good way we can verify the corrections are correct? Maybe doing a diff on the corrected corpus before/after the change?
Good idea, I'll PM you the corrected corpus after the change. Don't want to attach it to a Github issue for copyright reasons.
Pushed some additional changes:
Hi @frreiss ,
thanks for these corrections!
I would like to ask if the version of this PR can be used e.g. for further experiments. We would like to use the corrected dataset for more experiments for our paper :hugs:
Hi @stefan-it, sorry for the delay in getting back to you. I've just come back from vacation and am making one additional pass through the diffs on this branch before merging this PR. After that I'll tag a new release.
Finished going through the line-level diffs. I made some additional fixes to the CSV files. Most fixes involved workarounds for how scripts/download_and_correct_corpus.py
deals with multiple changes that affect the same entity mention. Merging this PR.
@stefan-it I've posted a new release, version 0.2, with the corrections from this PR.
Thanks for releasing that new version :heart: I will start to work with the new release now :+1:
This PR adjusts the CSV files under
corrected_labels/human_labels_audited
such that theLabel_Stats.ipynb
notebook will generate a correct version ofcorrected_labels/all_conll_corrections_combined.csv
when run.Most of the edits involved adjusting corrections in the input files to eliminate conflicting instructions for fixing errors.
The next most common type of edit involved fixing typos in spans -- extra spaces, incorrect offsets, and so on.
The least common type of edit involved fixing erroneous corrections. Most of these corrections had already been applied in previous manual edits of
all_conll_corrections_combined.csv
.Note to reviewers: I recommend that you start by looking at the changes to
corrected_labels/all_conll_corrections_combined.csv
.I've also added some
.gitignore
files to keep the corpus from being inadvertently added to Git.