CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

Remove leading column from merged corrections file. #31

Closed frreiss closed 4 years ago

frreiss commented 4 years ago

The file all_conll_corrections_combined.csv currently contains a leading column with Pandas index values. The values in this leading column will all change if there is a slight difference in the upstream data or if Label_Stats.ipynb is run on a different version of Python.

This PR removes the leading column to make future diffs smaller.

frreiss commented 4 years ago

I will push another change to Label_Stats.ipynb that adds index=False to the call to to_csv, so that the index column doesn't get added when someone reruns the script. There are other problems with the script that I am fixing now.

frreiss commented 4 years ago

It turns out that the follow-on fixes are more involved than I expected.. There are some conflicts between some of the labels in the audited CSV files, and the code that produces the four Boolean columns in all_conll_corrections_combined.csv seems to be producing unstable results. I'm going to redo this PR with those issues in mind.

BryanCutler commented 4 years ago

FYI @frreiss , I patched all_conll_corrections_combined.csv in #26 and made #28 to fix the root cause of those errors in the label files and scripts. I figured the patches will get us to the first release and align with the results from the paper, then fix the root causes which are more involved and might change the all_conll_corrections_combined.csv somewhat for a later second release. Does that make sense to do here as well?

frreiss commented 4 years ago

@BryanCutler yes, i think we should post a release of this project with just the current set of minimal corrections to all_conll_corrections_combined.csv (including the changes I just pushed that fix the "S Minn" problem).

Then we can put out another release that addresses the issues with the CSV files under human_labels_audited. I have already corrected several dozen data entry errors on my local copy. I'll cross-reference with the edits you mentioned in #26 and #28.