This PR adds functions to analyze the existence of duplications in a dataset.
Notably, this does not attempt to remove such duplicates, since the exact procedure is ambiguous (see #34).
In summary, there are the following functions and additions:
Addition of a cutoff_length parameter to the general Analyzer class, giving the option to limit the console output. This is necessary since some the duplication detection works by printing out affected samples (given that datasets need not have a unique identifier for each sample, this is the only way to communicate duplicates effectively without converting to dataframes internally).
find_identity_samples, which will print samples where reference text and summary are the exact same. Currently, this overlaps with the edge case of is_either_text_empty, in the case of both texts being empty. However, this is no practical problem, since both cases should be ignored anyways.
detect_leakage, which will scan for duplicate references (or summaries) across different dataset splits. Generally, no split is required, but sensible results in this function will only be obtained by passing two ore more splits (e.g., train and test, or validation and test). This is sensible since some datasets only contain particular splits, e.g., during shared tasks etc.
detect_duplicates works as an extension of detect_leakage, but also prints out intra-split duplicate references/summaries, i.e., with two samples in the train set having the same text. Internally calls detect_leakage to print inter-split duplicates.
This PR adds functions to analyze the existence of duplications in a dataset. Notably, this does not attempt to remove such duplicates, since the exact procedure is ambiguous (see #34).
In summary, there are the following functions and additions:
cutoff_length
parameter to the generalAnalyzer
class, giving the option to limit the console output. This is necessary since some the duplication detection works by printing out affected samples (given that datasets need not have a unique identifier for each sample, this is the only way to communicate duplicates effectively without converting to dataframes internally).find_identity_samples
, which will print samples where reference text and summary are the exact same. Currently, this overlaps with the edge case ofis_either_text_empty
, in the case of both texts being empty. However, this is no practical problem, since both cases should be ignored anyways.detect_leakage
, which will scan for duplicate references (or summaries) across different dataset splits. Generally, no split is required, but sensible results in this function will only be obtained by passing two ore more splits (e.g., train and test, or validation and test). This is sensible since some datasets only contain particular splits, e.g., during shared tasks etc.detect_duplicates
works as an extension ofdetect_leakage
, but also prints out intra-split duplicate references/summaries, i.e., with two samples in the train set having the same text. Internally callsdetect_leakage
to print inter-split duplicates.