CODAIT / Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Apache License 2.0
12 stars 2 forks source link

Incorporate sentence boundary error corrections into the integration script #11

Closed xuhdev closed 4 years ago

xuhdev commented 4 years ago

I'm keeping scripts/apply_sentence_correction.ipynb for now because submission datasets still rely on them. I verified they reproduce the same dataset files as generated manually before.

Fix #6

BryanCutler commented 4 years ago

I mean change the log message, right now it's exactly the same as the one in label corrections

On Tue, Oct 20, 2020, 4:41 PM Hong Xu notifications@github.com wrote:

@xuhdev commented on this pull request.

In scripts/download_corpus_and_correct_labels.py https://github.com/CODAIT/Identifying-Incorrect-Labels-In-CoNLL-2003/pull/11#discussion_r508902589 :

@@ -63,6 +64,33 @@ def apply_label_corrections(data_set_info, csv_file, target_dir=None, corpus_fol :param data_set_info: Dictionary containing a mapping from fold name to file name for each of the three folds (train, test, dev) of the corpus. :param csv_file: CSV file containing the label corrections

  • :param target_dir: (optional) Target directory to for the corrected corpus or
  • None for default of "corrected_corpus/label_only".
  • :param corpus_fold: (optional) Apply corrections to a specific fold only, or None for
  • the entire corpus.
  • :return:
  • """
  • target_dir = target_dir or os.path.join("corrected_corpus", "label_only")
  • fold_n_files = data_set_info.items() if corpus_fold is None \
  • else [(corpus_fold, data_set_info[corpus_fold])]
  • new_data_set_info = dict()
  • for fold, fold_file in fold_n_files:
  • target_file = os.path.join(target_dir, os.path.split(fold_file)[-1])
  • logging.info("Processing fold '{}' to file: '{}'".format(fold, target_file))

What do you mean? the label correction function is already before the sentence boundary correction function

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/CODAIT/Identifying-Incorrect-Labels-In-CoNLL-2003/pull/11#discussion_r508902589, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCTA5PKRUQZK447JHUVH73SLYN2RANCNFSM4SY6GCIA .

xuhdev commented 4 years ago

@BryanCutler Thanks for the review. I've pushed fixes for all your comments