Closed frreiss closed 4 years ago
I will push another change to Label_Stats.ipynb
that adds index=False
to the call to to_csv
, so that the index column doesn't get added when someone reruns the script. There are other problems with the script that I am fixing now.
It turns out that the follow-on fixes are more involved than I expected.. There are some conflicts between some of the labels in the audited CSV files, and the code that produces the four Boolean columns in all_conll_corrections_combined.csv
seems to be producing unstable results. I'm going to redo this PR with those issues in mind.
FYI @frreiss , I patched all_conll_corrections_combined.csv
in #26 and made #28 to fix the root cause of those errors in the label files and scripts. I figured the patches will get us to the first release and align with the results from the paper, then fix the root causes which are more involved and might change the all_conll_corrections_combined.csv
somewhat for a later second release. Does that make sense to do here as well?
@BryanCutler yes, i think we should post a release of this project with just the current set of minimal corrections to all_conll_corrections_combined.csv
(including the changes I just pushed that fix the "S Minn" problem).
Then we can put out another release that addresses the issues with the CSV files under human_labels_audited
. I have already corrected several dozen data entry errors on my local copy. I'll cross-reference with the edits you mentioned in #26 and #28.
The file
all_conll_corrections_combined.csv
currently contains a leading column with Pandas index values. The values in this leading column will all change if there is a slight difference in the upstream data or ifLabel_Stats.ipynb
is run on a different version of Python.This PR removes the leading column to make future diffs smaller.